io-bound-data-processing
Community I/O-bound data processing on constrained resources Best Practices
A reference for engineers processing datasets larger than RAM on a single low-compute box. Organized by execution-lifecycle impact: rules near the top of the table govern whether the job runs at all; rules near the bottom shave the last 10 %. Optimize from the top of the waterfall.
Scope: the patterns that show up in real ETL / data-engineering / batch work on a laptop, a 2-vCPU container, or a Raspberry Pi-class node — streaming, formats, chunking, spill, backpressure, codecs, and the concurrency model that actually matches an I/O-bound bottleneck. Out of scope (covered elsewhere): the algorithmic primitives themselves (see computer-science-algorithms), distributed compute beyond a single box (use Spark/Dask), and database-engine internals (see official docs).
Distilled from Apache Arrow / Parquet docs, Polars User Guide, DuckDB docs, pandas — Scaling to large datasets, Linux man pages (mmap(2), sendfile(2), posix_fadvise(2)), Brendan Gregg's USE method and Systems Performance, Kleppmann's Designing Data-Intensive Applications, and the zstd / lz4 reference benchmarks.
When to Apply
Reach for these rules when:
- A job OOM-kills, swaps, or runs much slower than expected on a small box
- Input is larger than RAM and you need to scan, filter, aggregate, sort, or join it
- A pipeline has unbounded buffers between stages, or memory grows linearly during a "streaming" job
- You see one-row-per-RTT writes (
INSERTper row,requests.getper URL,f.read(32)per record) - You're picking a format/codec/serializer and the choice matters at scale
- A
topshows low CPU and high iowait, or you don't know which it is - "It's slow but I don't know why" — start at the obs- category