dask-parallel-computing
Installation
SKILL.md
Dask — Parallel & Distributed Computing
Overview
Dask is a Python library for parallel and distributed computing that scales familiar pandas/NumPy APIs to larger-than-memory datasets. It provides five main components (DataFrames, Arrays, Bags, Futures, Schedulers) and scales from single-machine multi-core to multi-node HPC clusters.
When to Use
- Processing datasets that exceed available RAM (10 GB–100 TB)
- Parallelizing pandas or NumPy operations across multiple cores
- Processing multiple files efficiently (CSV, Parquet, JSON, HDF5, Zarr)
- Building custom parallel workflows with task dependencies
- Distributing workloads across HPC clusters (SLURM, Kubernetes)
- Streaming/ETL pipelines for unstructured data (logs, JSON records)
- For in-memory single-machine speed: use polars instead
- For out-of-core single-machine analytics: use vaex instead