resiliency

Installation
SKILL.md

Resiliency

Stable docs: docs/training/resiliency.md, docs/training/checkpointing.md Card: card.yaml (co-located)

Enablement

Fault tolerance (Slurm only)

Option 1: NeMo Run plugin (recommended)

from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
Related skills

More from nvidia-nemo/megatron-bridge

Installs
1
GitHub Stars
577
First Seen
Apr 19, 2026