torch-pipeline-parallelism
Installation
SKILL.md
Torch Pipeline Parallelism
Overview
This skill provides guidance for implementing pipeline parallelism in PyTorch for distributed model training. Pipeline parallelism partitions a model across multiple devices/ranks, where each rank processes a subset of layers and communicates activations/gradients with neighboring ranks.
Key Concepts
Pipeline Parallelism Patterns
- AFAB (All-Forward-All-Backward): Process all microbatch forwards first, cache activations, then process all backwards. This is the most common pattern for pipeline parallelism.
- 1F1B (One-Forward-One-Backward): Interleave forward and backward passes for better memory efficiency but more complex scheduling.
Critical Components
- Model Partitioning: Divide model layers across ranks
- Activation Communication: Send/receive hidden states between ranks
- Gradient Communication: Send/receive gradients during backward pass
- Activation Caching: Store activations for backward pass computation