Multi-Node Slurm

Convert single-node uv run python -m torch.distributed.run commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.

Two Approaches: srun-native vs uv run torch.distributed

Approach	`ntasks-per-node`	Process spawning	Best for
srun-native (preferred)	8	Slurm spawns 8 tasks/node	Conversion, inference, Bridge scripts
uv run torch.distributed (legacy)	1	`uv run python -m torch.distributed.run` spawns 8 procs/node	MLM pretrain_gpt.py

Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT from SLURM env vars (SLURM_PROCID, SLURM_NTASKS, SLURM_LOCALID, SLURM_NODELIST) via common_utils.py helpers called during initialize.py distributed init, so you never need to set them manually.

Cluster Environment

Container

CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"

multi-node-slurm

Multi-Node Slurm

Two Approaches: srun-native vs uv run torch.distributed

Cluster Environment

Container

More from nvidia-nemo/megatron-bridge

developer-guide

parity-testing

code-style

resiliency

adding-model-support

mlm-bridge-training