nemo-mbridge-perf-memory-tuning
Installation
SKILL.md
Memory Tuning
Stable docs: @docs/parallelisms.md Card: @skills/nemo-mbridge-perf-memory-tuning/card.yaml
What It Is
GPU OOM failures during training often stem from memory fragmentation rather than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes.