Overview

PyTorch Distributed enables training models across multiple GPUs and nodes. DistributedDataParallel (DDP) is the standard for multi-process data parallelism, while Fully Sharded Data Parallel (FSDP) shards model state to allow training models too large for a single GPU.

When to Use

Use DDP for general multi-GPU training on a single or multiple nodes. Use FSDP when model parameters, gradients, and optimizer states exceed the memory of a single GPU.

Decision Tree

Does your model fit on one GPU?
- YES: Use DistributedDataParallel (DDP).
- NO: Use Fully Sharded Data Parallel (FSDP).
Are you launching the job?
- USE: torchrun to handle environmental setup and fault recovery.
Are you saving a checkpoint?
- DO: Only save on rank == 0 to avoid file corruption and redundant I/O.

pytorch-distributed

Overview

When to Use

Decision Tree

Workflows