pytorch-fsdp2

Installation
SKILL.md

Skill: Use PyTorch FSDP2 (fully_shard) correctly in a training script

This skill teaches a coding agent how to add PyTorch FSDP2 to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing.

FSDP2 in PyTorch is exposed primarily via torch.distributed.fsdp.fully_shard and the FSDPModule methods it adds in-place to modules. See: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.


When to use this skill

Use FSDP2 when:

  • Your model doesn’t fit on one GPU (parameters + gradients + optimizer state).
  • You want an eager-mode sharding approach that is DTensor-based per-parameter sharding (more inspectable, simpler sharded state dicts) than FSDP1.
  • You may later compose DP with Tensor Parallel using DeviceMesh.

Avoid (or be careful) if:

  • You need strict backwards-compatible checkpoints across PyTorch versions (DCP warns against this).
  • You’re forced onto older PyTorch versions without the FSDP2 stack.
Related skills
Installs
5
GitHub Stars
5
First Seen
Mar 28, 2026