multi-node-slurm
Multi-Node Slurm
Convert single-node uv run python -m torch.distributed.run commands into multi-node Slurm sbatch scripts with Enroot container support, and debug common multi-node failures.
Two Approaches: srun-native vs uv run torch.distributed
| Approach | ntasks-per-node |
Process spawning | Best for |
|---|---|---|---|
| srun-native (preferred) | 8 | Slurm spawns 8 tasks/node | Conversion, inference, Bridge scripts |
| uv run torch.distributed (legacy) | 1 | uv run python -m torch.distributed.run spawns 8 procs/node |
MLM pretrain_gpt.py |
Prefer srun-native — simpler, avoids shell escaping issues with TRAIN_CMD. Megatron Bridge auto-derives RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT from SLURM env vars (SLURM_PROCID, SLURM_NTASKS, SLURM_LOCALID, SLURM_NODELIST) via common_utils.py helpers called during initialize.py distributed init, so you never need to set them manually.
Cluster Environment
Container
CONTAINER_IMAGE="<PATH_TO_YOUR_CONTAINER>.sqsh"
More from nvidia-nemo/megatron-bridge
developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1parity-testing
Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill. Use when debugging weight mismatches, verifying checkpoint round-trips, or choosing which verification tool to run.
1code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1resiliency
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext.
1adding-model-support
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. Use when the user asks to add, support, onboard, or integrate a new model, or when creating bridges, providers, or recipes for a new model family.
1mlm-bridge-training
Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples. Use when running training, comparing MLM vs Bridge, or translating configs.
1