nemo-mbridge-perf-cuda-graphs

Installation
SKILL.md

CUDA Graphs

Stable documentation: @docs/training/cuda-graphs.md Card: @skills/nemo-mbridge-perf-cuda-graphs/card.yaml

What It Is

CUDA graphs capture GPU operations once and replay them with minimal host-driver overhead. Bridge supports two implementations:

cuda_graph_impl Mechanism Scope support
"local" MCore FullCudaGraphWrapper wrapping entire fwd+bwd full_iteration
"transformer_engine" TE make_graphed_callables() per layer attn, mlp, moe, moe_router, moe_preprocess, mamba

Quick Decision

Start with TE-scoped graphs for most training workloads, then verify replay timing against eager on the same dispatcher, layout, and container:

Installs
138
Repository
nvidia/skills
GitHub Stars
1.0K
First Seen
7 days ago
nemo-mbridge-perf-cuda-graphs — nvidia/skills