nemo-mbridge-resiliency
Installation
SKILL.md
Resiliency
Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md Card: @skills/nemo-mbridge-resiliency/card.yaml
Enablement
Fault tolerance (Slurm only)
Option 1: NeMo Run plugin (recommended)
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run