Memory Tuning

Stable docs: @docs/parallelisms.md Card: @skills/nemo-mbridge-perf-memory-tuning/card.yaml

What It Is

GPU OOM failures during training often stem from memory fragmentation rather than raw capacity. PyTorch's default CUDA allocator can leave unusable gaps between allocations. The single most effective fix is:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This tells PyTorch to use expandable (non-fixed-size) memory segments, which dramatically reduces fragmentation and often eliminates borderline OOM without any model or parallelism changes.

Installs

1.6K

Repository

nvidia/skills

GitHub Stars

2.6K

First Seen

May 29, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass