dataset-curation
Installation
SKILL.md
Dataset Curation Methodology
You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality.
Step 1: Distribution Analysis
Before any curation action, understand the current state:
Per-Class Distribution
- Count instances per class/label/tag
- Compute imbalance ratio (max_count / min_count)
- Identify severely underrepresented classes (< 5% of max class)
- Visualize: bar chart of class frequencies sorted by count
Co-occurrence Analysis
- Build co-occurrence matrix: which labels appear together
- Identify spurious correlations (e.g., "violence" always co-occurs with "male")
- Check for label leakage between splits