multi-node-slurm

Installation
SKILL.md

Multi-Node Slurm

Convert single-node uv run python -m torch.distributed.run commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.

Two Approaches: srun-native vs uv run torch.distributed

Approach ntasks-per-node Process spawning Best for
srun-native (preferred) 8 Slurm spawns 8 tasks/node Conversion, inference, Bridge scripts
uv run torch.distributed (legacy) 1 uv run python -m torch.distributed.run spawns 8 procs/node MLM pretrain_gpt.py

Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT from SLURM env vars (SLURM_PROCID, SLURM_NTASKS, SLURM_LOCALID, SLURM_NODELIST) via common_utils.py helpers called during initialize.py distributed init, so you never need to set them manually.

Cluster Environment

Container

CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
Related skills

More from nvidia-nemo/megatron-bridge

Installs
1
GitHub Stars
577
First Seen
Apr 19, 2026