nemo-mbridge-resiliency

Installation
SKILL.md

Resiliency

Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md Card: @skills/nemo-mbridge-resiliency/card.yaml

Enablement

Fault tolerance (Slurm only)

Option 1: NeMo Run plugin (recommended)

from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run
Installs
134
Repository
nvidia/skills
GitHub Stars
1.0K
First Seen
7 days ago
nemo-mbridge-resiliency — nvidia/skills