nemo-mbridge-multi-node-slurm
Installation
SKILL.md
Multi-Node Slurm
Convert single-node uv run python -m torch.distributed.run commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.
Two Approaches: srun-native vs uv run torch.distributed
| Approach | ntasks-per-node |
Process spawning | Best for |
|---|---|---|---|
| srun-native (preferred) | 8 | Slurm spawns 8 tasks/node | Conversion, inference, Bridge scripts |
| uv run torch.distributed (legacy) | 1 | uv run python -m torch.distributed.run spawns 8 procs/node |
MLM pretrain_gpt.py |
Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT from SLURM env vars (SLURM_PROCID, SLURM_NTASKS, SLURM_LOCALID, SLURM_NODELIST) via common_utils.py helpers called during initialize.py distributed init, so you never need to set them manually.