data-engineering-quality
Installation
SKILL.md
Data Quality and Testing
Data validation and testing frameworks for ensuring pipeline correctness and data quality: Great Expectations (enterprise) and Pandera (lightweight). Integrates with orchestration tools for automated validation.
Quick Comparison
| Feature | Great Expectations | Pandera |
|---|---|---|
| Approach | Declarative "expectations" | Schema definitions with checks |
| DataFrame Support | Pandas, Spark, SQL, BigQuery | Pandas, Polars, PySpark, Dask |
| Validation Output | JSON results with detailed diagnostics | Boolean or exception |
| Best For | Enterprise data platforms, comprehensive profiling | Python-centric pipelines, lightweight |
| Learning Curve | Steeper (concepts: DataContext, Checkpoints) | Lower (Python decorators/classes) |
| Integration | CI/CD, Airflow, Prefect, Dagster | pytest, FastAPI, any Python code |
When to Use Which?
- Great Expectations: You need comprehensive data documentation (data docs), profiling, and validation with rich reporting. Organizations with dedicated data quality teams.