pytorch-distributed
Installation
SKILL.md
Overview
PyTorch Distributed enables training models across multiple GPUs and nodes. DistributedDataParallel (DDP) is the standard for multi-process data parallelism, while Fully Sharded Data Parallel (FSDP) shards model state to allow training models too large for a single GPU.
When to Use
Use DDP for general multi-GPU training on a single or multiple nodes. Use FSDP when model parameters, gradients, and optimizer states exceed the memory of a single GPU.
Decision Tree
- Does your model fit on one GPU?
- YES: Use
DistributedDataParallel(DDP). - NO: Use
Fully Sharded Data Parallel(FSDP).
- YES: Use
- Are you launching the job?
- USE:
torchrunto handle environmental setup and fault recovery.
- USE:
- Are you saving a checkpoint?
- DO: Only save on
rank == 0to avoid file corruption and redundant I/O.
- DO: Only save on