Awesome open data-centric AI
Open source tooling for data-centric AI on unstructured data
Renumics Spotlight |
Curation tool for unstructured data that connects your stack to the data-centric AI ecosystem. |
|
|
|
Argilla |
Argilla helps domain experts and data teams to build better NLP datasets in less time. |
|
|
Exploratory data analysis (EDA)
| Name |
Data type |
Description |
Notebook |
| Understand distributions |
image |
Use the Huggingface transformers library to compute image embeddings and explore the dataset based on the similarity map and additional metdata. |
|
Cleaning
| Name |
Data type |
Description |
Notebook |
| Detect duplicates |
agnostic |
Use the Annoy library to detect nearest neighbors in the embedding space and inspect data points that are duplicates / near duplicates. |
|
| Detect outliers |
agnostic |
Use the Cleanlab library to compute outlier scores based on model output (embeddings, probabilities) and inspect outlier candidates. |
|
| Detect image issues |
image |
Use the Cleanvision library to extrapact typical image issues (brightness, blurr, aspect ratio, SNR and duplicates) and identify critical segments through manual inspection. |
|
Annotation
| Name |
Data type |
Description |
Notebook |
| Find label inconsistencies |
agnostic |
Use the Cleanlab library to compute label error flags based on model probabilities and manually inspect critical data segments. |
|
Modeling
| Name |
Data type |
Description |
Notebook |
| Detect leakage |
agnostic |
Use nearest neighbor distances to identify candidates for data leakage and manual inspect them |
|
Validation
| Name |
Data type |
Description |
Notebook |
| Inspect decision boundaries |
agnostic |
Compute a decision boundary score based on certainty ratios and inspect the results in a scatter plot. |
|
Monitoring
| Name |
Data type |
Description |
Notebook |
| Detect data drift |
agnostic |
Compute the cosine distance of the k-nearest neighbor in the embedding space as the drift distance and inspect critical segments. |
|
Further reading
In order to keep a useful focus and to prevent duplicate work, we excluded some topics from this list. Read more about them here:
- DCAI tools for tabular data. There is an awesome list for that maintained by the Ydata team.
- Labeling tools. Although labeling is part of the DCAI workflow, we refer to the awesome list of the ZenML team on that topic.
- MLOps tooling. We exclude all topics that are clearly out of the DCAI scope and refer to established MLOps awesome lists for these tools.
- Research papers. We focus on industrial-ready open source tooling, check out this list for a research-oriented view on DCAI.
Expand
|