Python data pipeline development

Patterns for building production-quality data processing pipelines with Python.

Targeted at Python 3.11+ for asyncio.TaskGroup and exception groups; Python 3.12+ for the lighter type X = ... syntax. Pin a 3.13+ runtime if you want the JIT or experimental free-threading; the patterns here don't depend on either.

Choosing a DataFrame engine: pandas vs polars vs DuckDB

For a long time pandas was the default for any tabular work in Python. As of 2026 the default has shifted: polars is the right pick for multi-GB pipelines on a single machine, DuckDB is the right pick when SQL or larger-than-RAM scans are involved, and pandas stays useful for small data and the ML/notebook ecosystem (scikit-learn, statsmodels, plotnine all speak it natively).

Tool	When	Why
pandas	< ~1 GB data, ML interop, single-threaded familiarity	Mature, ubiquitous, eager DataFrame model. Slowest in benchmarks but most ecosystem support.
polars	1 GB - tens of GB on one box, performance-critical pipelines	Multithreaded by default, lazy query engine, Arrow-native. ~5x speedup over pandas on filter / aggregate at 100M rows.
DuckDB	SQL workflows, larger-than-RAM, parquet/CSV scanning, joins across many files	Vectorized + pipelined execution, cost-based optimizer, streaming scans. Works great as a thin wrapper over a directory of parquet files.

All three speak Apache Arrow, so zero-copy interop between them is the pragmatic answer most of the time:

python-pipeline

Python data pipeline development

Choosing a DataFrame engine: pandas vs polars vs DuckDB

More from jamditis/claude-skills-journalism

web-scraping

academic-writing

page-monitoring

pdf-design

social-media-intelligence

source-verification