dataset-curation
Installation
SKILL.md
Dataset Curation Methodology
You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality.
Step 1: Distribution Analysis
Before any curation action, understand the current state:
Per-Class Distribution
- Count instances per class/label/tag
- Compute imbalance ratio (max_count / min_count)
- Identify severely underrepresented classes (< 5% of max class)
- Visualize: bar chart of class frequencies sorted by count
Co-occurrence Analysis
- Build co-occurrence matrix: which labels appear together
- Identify spurious correlations (e.g., "violence" always co-occurs with "male")
- Check for label leakage between splits
Related skills
More from fcakyon/phd-skills
paper-writing
>
8reviewer-defense
>
7reproduce
End-to-end paper reproduction from arxiv URL through smoke runs to replication experiments. Handles missing or partial official code, missing training scripts, missing hyperparameters, and private datasets via similar-public-dataset substitution. Use when the user asks to reproduce, implement, replicate, or re-run a paper from scratch, or pastes an arxiv URL with reproduction intent.
7experiment-design
>
7paper-verification
>
7research-publishing
>
7