data-engineering-quality
Data Quality and Testing
Data validation and testing frameworks for ensuring pipeline correctness and data quality: Great Expectations (enterprise) and Pandera (lightweight). Integrates with orchestration tools for automated validation.
Quick Comparison
| Feature | Great Expectations | Pandera |
|---|---|---|
| Approach | Declarative "expectations" | Schema definitions with checks |
| DataFrame Support | Pandas, Spark, SQL, BigQuery | Pandas, Polars, PySpark, Dask |
| Validation Output | JSON results with detailed diagnostics | Boolean or exception |
| Best For | Enterprise data platforms, comprehensive profiling | Python-centric pipelines, lightweight |
| Learning Curve | Steeper (concepts: DataContext, Checkpoints) | Lower (Python decorators/classes) |
| Integration | CI/CD, Airflow, Prefect, Dagster | pytest, FastAPI, any Python code |
When to Use Which?
- Great Expectations: You need comprehensive data documentation (data docs), profiling, and validation with rich reporting. Organizations with dedicated data quality teams.
More from legout/data-platform-agent-skills
data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
13data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-feature-engineering
Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
8