pytorch-distributed

Installation
SKILL.md

Overview

PyTorch Distributed enables training models across multiple GPUs and nodes. DistributedDataParallel (DDP) is the standard for multi-process data parallelism, while Fully Sharded Data Parallel (FSDP) shards model state to allow training models too large for a single GPU.

When to Use

Use DDP for general multi-GPU training on a single or multiple nodes. Use FSDP when model parameters, gradients, and optimizer states exceed the memory of a single GPU.

Decision Tree

  1. Does your model fit on one GPU?
    • YES: Use DistributedDataParallel (DDP).
    • NO: Use Fully Sharded Data Parallel (FSDP).
  2. Are you launching the job?
    • USE: torchrun to handle environmental setup and fault recovery.
  3. Are you saving a checkpoint?
    • DO: Only save on rank == 0 to avoid file corruption and redundant I/O.

Workflows

Installs
2
First Seen
Feb 9, 2026
pytorch-distributed — cuba6112/skillfactory