mlm-bridge-training

Installation

SKILL.md

MLM vs Bridge Training

For how they differ, the arg mapping tables, gotchas, and translation script, see:

docs/megatron-lm-to-megatron-bridge.md

Correlation Testing

Use vanilla_gpt_pretrain_config for loss-correlation testing. This recipe uses bare GPTModelProvider defaults (LayerNorm, GeLU, learned_absolute position embeddings, vocab_size inherited from tokenizer) — matching MLM pretrain_gpt.py defaults with no args.

MLM Correlation Run (2L/256H, 1 GPU)

PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  3rdparty/Megatron-LM/pretrain_gpt.py \

Related skills

More from nvidia-nemo/megatron-bridge

multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. Use when creating Slurm scripts, scaling to multi-node, or debugging multi-node job failures.
1
developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1
parity-testing
Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill. Use when debugging weight mismatches, verifying checkpoint round-trips, or choosing which verification tool to run.
1
code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1
resiliency
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext.
1
adding-model-support
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. Use when the user asks to add, support, onboard, or integrate a new model, or when creating bridges, providers, or recipes for a new model family.
1

Installs

Repository

nvidia-nemo/meg…n-bridge

GitHub Stars

577

First Seen

Apr 19, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

mlm-bridge-training

MLM vs Bridge Training

Correlation Testing

MLM Correlation Run (2L/256H, 1 GPU)

More from nvidia-nemo/megatron-bridge

multi-node-slurm

developer-guide

parity-testing

code-style

resiliency

adding-model-support