Community I/O-bound data processing on constrained resources Best Practices

A reference for engineers processing datasets larger than RAM on a single low-compute box. Organized by execution-lifecycle impact: rules near the top of the table govern whether the job runs at all; rules near the bottom shave the last 10 %. Optimize from the top of the waterfall.

Scope: the patterns that show up in real ETL / data-engineering / batch work on a laptop, a 2-vCPU container, or a Raspberry Pi-class node — streaming, formats, chunking, spill, backpressure, codecs, and the concurrency model that actually matches an I/O-bound bottleneck. Out of scope (covered elsewhere): the algorithmic primitives themselves (see computer-science-algorithms), distributed compute beyond a single box (use Spark/Dask), and database-engine internals (see official docs).

Distilled from Apache Arrow / Parquet docs, Polars User Guide, DuckDB docs, pandas — Scaling to large datasets, Linux man pages (mmap(2), sendfile(2), posix_fadvise(2)), Brendan Gregg's USE method and Systems Performance, Kleppmann's Designing Data-Intensive Applications, and the zstd / lz4 reference benchmarks.

When to Apply

Reach for these rules when:

A job OOM-kills, swaps, or runs much slower than expected on a small box
Input is larger than RAM and you need to scan, filter, aggregate, sort, or join it
A pipeline has unbounded buffers between stages, or memory grows linearly during a "streaming" job
You see one-row-per-RTT writes (INSERT per row, requests.get per URL, f.read(32) per record)
You're picking a format/codec/serializer and the choice matters at scale
A top shows low CPU and high iowait, or you don't know which it is
"It's slow but I don't know why" — start at the obs- category

io-bound-data-processing

Community I/O-bound data processing on constrained resources Best Practices

When to Apply