dask-parallel-computing

Installation
SKILL.md

Dask — Parallel & Distributed Computing

Overview

Dask is a Python library for parallel and distributed computing that scales familiar pandas/NumPy APIs to larger-than-memory datasets. It provides five main components (DataFrames, Arrays, Bags, Futures, Schedulers) and scales from single-machine multi-core to multi-node HPC clusters.

When to Use

  • Processing datasets that exceed available RAM (10 GB–100 TB)
  • Parallelizing pandas or NumPy operations across multiple cores
  • Processing multiple files efficiently (CSV, Parquet, JSON, HDF5, Zarr)
  • Building custom parallel workflows with task dependencies
  • Distributing workloads across HPC clusters (SLURM, Kubernetes)
  • Streaming/ETL pipelines for unstructured data (logs, JSON records)
  • For in-memory single-machine speed: use polars instead
  • For out-of-core single-machine analytics: use vaex instead

Prerequisites

Related skills

More from jaechang-hits/sciagent-skills

Installs
9
GitHub Stars
152
First Seen
Mar 16, 2026