python-pipeline

Installation
SKILL.md

Python data pipeline development

Patterns for building production-quality data processing pipelines with Python.

Targeted at Python 3.11+ for asyncio.TaskGroup and exception groups; Python 3.12+ for the lighter type X = ... syntax. Pin a 3.13+ runtime if you want the JIT or experimental free-threading; the patterns here don't depend on either.

Choosing a DataFrame engine: pandas vs polars vs DuckDB

For a long time pandas was the default for any tabular work in Python. As of 2026 the default has shifted: polars is the right pick for multi-GB pipelines on a single machine, DuckDB is the right pick when SQL or larger-than-RAM scans are involved, and pandas stays useful for small data and the ML/notebook ecosystem (scikit-learn, statsmodels, plotnine all speak it natively).

Tool When Why
pandas < ~1 GB data, ML interop, single-threaded familiarity Mature, ubiquitous, eager DataFrame model. Slowest in benchmarks but most ecosystem support.
polars 1 GB - tens of GB on one box, performance-critical pipelines Multithreaded by default, lazy query engine, Arrow-native. ~5x speedup over pandas on filter / aggregate at 100M rows.
DuckDB SQL workflows, larger-than-RAM, parquet/CSV scanning, joins across many files Vectorized + pipelined execution, cost-based optimizer, streaming scans. Works great as a thin wrapper over a directory of parquet files.

All three speak Apache Arrow, so zero-copy interop between them is the pragmatic answer most of the time:

Related skills

More from jamditis/claude-skills-journalism

Installs
93
GitHub Stars
201
First Seen
Jan 21, 2026