Dask — Parallel & Distributed Computing

Overview

Dask is a Python library for parallel and distributed computing that scales familiar pandas/NumPy APIs to larger-than-memory datasets. It provides five main components (DataFrames, Arrays, Bags, Futures, Schedulers) and scales from single-machine multi-core to multi-node HPC clusters.

When to Use

Processing datasets that exceed available RAM (10 GB–100 TB)
Parallelizing pandas or NumPy operations across multiple cores
Processing multiple files efficiently (CSV, Parquet, JSON, HDF5, Zarr)
Building custom parallel workflows with task dependencies
Distributing workloads across HPC clusters (SLURM, Kubernetes)
Streaming/ETL pipelines for unstructured data (logs, JSON records)
For in-memory single-machine speed: use polars instead
For out-of-core single-machine analytics: use vaex instead

dask-parallel-computing

Dask — Parallel & Distributed Computing

Overview

When to Use

Prerequisites