data-cleaning

Installation
SKILL.md

Data Cleaning

This skill enables an AI agent to systematically clean and preprocess raw datasets into analysis-ready form. The agent handles missing values, duplicate records, data type mismatches, inconsistent formats, outlier treatment, and normalization. It can also enforce validation schemas to ensure ongoing data quality. The primary toolchain is pandas with support from pyjanitor and great_expectations for advanced validation.

Workflow

  1. Ingest and profile the raw data. Load the dataset and immediately generate a quality report: count nulls per column, identify duplicate rows, check data types against expected schema, and flag columns with mixed types. This profile drives every subsequent cleaning decision.

  2. Handle missing values. Apply strategy per column based on data type and missingness pattern. For numeric columns with less than 5% missing, use median imputation. For categorical columns, use mode or a dedicated "Unknown" category. For columns missing more than 40%, flag them for potential removal and consult the user before dropping.

  3. Remove duplicates and resolve conflicts. Identify exact duplicates and near-duplicates (e.g., rows differing only in whitespace or casing). For exact duplicates, keep the first occurrence. For near-duplicates, apply fuzzy matching with a configurable similarity threshold and merge conflicting values by recency or completeness.

  4. Correct data types and standardize formats. Coerce columns to their intended types — parse date strings into datetime objects, convert numeric strings to floats, and normalize categorical values to a canonical form. Standardize formats such as phone numbers, postal codes, and currency representations.

  5. Detect and treat outliers. Use the IQR method (1.5x) for symmetric distributions and z-scores for normally distributed data. Offer three treatment options: cap at boundary values (winsorization), replace with null for later imputation, or flag-only mode that annotates but preserves original values.

  6. Validate the cleaned output. Run the cleaned dataset through validation rules — non-null constraints, range checks, uniqueness constraints, and referential integrity. Report any remaining violations and save the clean dataset alongside a cleaning log that documents every transformation applied.

Supported Technologies

Related skills
Installs
10
GitHub Stars
78
First Seen
Mar 19, 2026