nemo-mbridge-perf-cuda-graphs
Installation
SKILL.md
CUDA Graphs
Stable documentation: @docs/training/cuda-graphs.md Card: @skills/nemo-mbridge-perf-cuda-graphs/card.yaml
What It Is
CUDA graphs capture GPU operations once and replay them with minimal host-driver overhead. Bridge supports two implementations:
cuda_graph_impl |
Mechanism | Scope support |
|---|---|---|
"local" |
MCore FullCudaGraphWrapper wrapping entire fwd+bwd |
full_iteration |
"transformer_engine" |
TE make_graphed_callables() per layer |
attn, mlp, moe, moe_router, moe_preprocess, mamba |
Quick Decision
Start with TE-scoped graphs for most training workloads, then verify replay timing against eager on the same dispatcher, layout, and container: