dataset-curation

Installation
SKILL.md

Dataset Curation Methodology

You are helping a researcher curate, analyze, or expand a dataset with attention to bias, fairness, and quality.

Step 1: Distribution Analysis

Before any curation action, understand the current state:

Per-Class Distribution

  • Count instances per class/label/tag
  • Compute imbalance ratio (max_count / min_count)
  • Identify severely underrepresented classes (< 5% of max class)
  • Visualize: bar chart of class frequencies sorted by count

Co-occurrence Analysis

  • Build co-occurrence matrix: which labels appear together
  • Identify spurious correlations (e.g., "violence" always co-occurs with "male")
  • Check for label leakage between splits
Related skills
Installs
7
GitHub Stars
192
First Seen
Apr 20, 2026