dask-optimization
Installation
SKILL.md
Dask - Advanced Optimization & Cluster Tuning
Parallel computing is not "free". In a distributed environment, the cost of moving data (network I/O) and scheduling tasks can often exceed the computation time. This guide focuses on minimizing overhead and maximizing throughput.
When to Use
- Your Dask jobs are failing with "KilledWorker" or "OutOfMemory" errors.
- The Dask Dashboard shows a lot of "red" (communication) or "gray" (idle) time.
- You need to process datasets that are 10x-100x larger than the total RAM of your cluster.
- You are building custom distributed algorithms using
dask.delayedor Futures. - You need to optimize resource allocation (CPU vs. Threads) for specific workloads.
Reference Documentation
- Best Practices: https://docs.dask.org/en/latest/best-practices.html
- Distributed Diagnostics: https://distributed.dask.org/en/latest/diagnosing-performance.html
- Memory Management: https://distributed.dask.org/en/latest/worker.html#memory-management
- Search patterns:
client.scatter,dask.compute(optimize_graph=True),repartition,client.restart