parity-testing
Parity Testing for Megatron Bridge
This skill provides the decision framework for choosing the right
verification tool and interpreting results. For the full model onboarding
workflow (which includes parity testing as milestones 1 and 2), see the
add-model-support skill.
Quick Decision: Which Tool to Run
| What you want to verify | Tool | GPU? | When to use |
|---|---|---|---|
| All weights round-trip exactly (single GPU) | hf_megatron_roundtrip.py |
No | First check after writing a bridge |
| Weights round-trip with TP/PP/EP | hf_megatron_roundtrip_multi_gpu.py |
Yes | After single-GPU passes |
| Forward-pass logit equivalence | compare_hf_and_megatron/compare.py |
Yes | After round-trip passes |
| Text generation sanity | hf_to_megatron_generate_text.py |
Yes | Large models that OOM compare.py |
| Programmatic weight check | weights_verification_table() |
Yes | Inside Python scripts |
| VLM generation sanity | hf_to_megatron_generate_vlm.py |
Yes | VLM models |
All tools live under examples/conversion/.
More from nvidia-nemo/megatron-bridge
multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. Use when creating Slurm scripts, scaling to multi-node, or debugging multi-node job failures.
1developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1resiliency
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext.
1adding-model-support
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. Use when the user asks to add, support, onboard, or integrate a new model, or when creating bridges, providers, or recipes for a new model family.
1mlm-bridge-training
Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples. Use when running training, comparing MLM vs Bridge, or translating configs.
1