data-engineering
Data Engineering Hub
Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains.
Skill Map
| Domain | Skills | When to Use |
|---|---|---|
| Core | @data-engineering-core |
Polars, DuckDB, PyArrow fundamentals; ETL patterns; error handling; performance optimization |
| Storage | @data-engineering-storage-lakehouse |
Delta Lake, Apache Iceberg, Apache Hudi |
@data-engineering-storage-remote-access |
fsspec, pyarrow.fs, obstore; cloud access patterns | |
@data-engineering-storage-authentication |
AWS, GCP, Azure auth - IAM roles, managed identity, secrets management | |
@data-engineering-storage-formats |
Parquet optimizations, Lance, Zarr, Avro, ORC | |
| Orchestration | @data-engineering-orchestration |
Prefect, Dagster, dbt, workflow scheduling |
| Streaming | @data-engineering-streaming |
Kafka, MQTT, NATS JetStream for real-time data |
| Quality | @data-engineering-quality |
Great Expectations, Pandera for data validation |
| Observability | @data-engineering-observability |
OpenTelemetry, Prometheus for pipeline monitoring |
| AI/ML | @data-engineering-ai-ml |
Embeddings, vector databases, RAG pipelines |
| Best Practices | @data-engineering-best-practices |
Medallion architecture, partitioning, file sizing, incremental loads, schema evolution, testing |
More from legout/data-platform-agent-skills
data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
13data-science-visualization
Data visualization for Python: Matplotlib, Seaborn, Plotly, Altair, hvPlot/HoloViz, and Bokeh. Use when creating exploratory charts, interactive dashboards, publication-quality figures, or choosing the right library for your data and audience.
12data-engineering-core
Core Python data engineering: Polars, DuckDB, PyArrow, PostgreSQL, ETL patterns, performance tuning, and resilient pipeline construction. Use when building or reviewing batch ETL/dataframe/SQL pipelines in Python.
10data-science-feature-engineering
Feature engineering for machine learning: encoding, scaling, transformations, datetime features, text features, and feature selection. Use when preparing data for modeling or improving model performance through better representations.
10data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
9data-engineering-best-practices
Data engineering best practices: medallion architecture, dataset lifecycle, partitioning, file sizing, schema evolution, and append/overwrite/merge patterns across Polars, PyArrow, DuckDB, Delta Lake, and Iceberg. Use when designing production data pipelines or reviewing data platform decisions.
8