parity-testing

Installation

SKILL.md

Parity Testing for Megatron Bridge

This skill provides the decision framework for choosing the right verification tool and interpreting results. For the full model onboarding workflow (which includes parity testing as milestones 1 and 2), see the add-model-support skill.

Quick Decision: Which Tool to Run

What you want to verify	Tool	GPU?	When to use
All weights round-trip exactly (single GPU)	`hf_megatron_roundtrip.py`	No	First check after writing a bridge
Weights round-trip with TP/PP/EP	`hf_megatron_roundtrip_multi_gpu.py`	Yes	After single-GPU passes
Forward-pass logit equivalence	`compare_hf_and_megatron/compare.py`	Yes	After round-trip passes
Text generation sanity	`hf_to_megatron_generate_text.py`	Yes	Large models that OOM compare.py
Programmatic weight check	`weights_verification_table()`	Yes	Inside Python scripts
VLM generation sanity	`hf_to_megatron_generate_vlm.py`	Yes	VLM models

All tools live under examples/conversion/.

Related skills

More from nvidia-nemo/megatron-bridge

multi-node-slurm
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation. Use when creating Slurm scripts, scaling to multi-node, or debugging multi-node job failures.
1
developer-guide
Developer environment setup, CI/CD workflows, and CI failure debugging for Megatron Bridge. Covers container-based development, uv package management, pre-commit hooks, running tests, CI failure investigation, and common pitfalls. Use when onboarding, setting up a dev environment, troubleshooting build issues, investigating CI failures, or dealing with lockfile issues (corrupted, regenerating, or updating uv.lock).
1
code-style
Code style and quality guidelines for Megatron Bridge. Covers naming, type hints, ruff enforcement, keyword-arg safety, copyright headers, logging, and common anti-patterns. Auto-invoked during code review and when writing new code.
1
resiliency
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine. Use when the user asks about fault tolerance, straggler detection, hang detection, automatic restart, preemption, in-process restart, checkpoint recovery, or nvidia-resiliency-ext.
1
adding-model-support
Guide for adding support for new LLM or VLM models in Megatron-Bridge. Covers bridge, provider, recipe, tests, docs, and examples. Use when the user asks to add, support, onboard, or integrate a new model, or when creating bridges, providers, or recipes for a new model family.
1
mlm-bridge-training
Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples. Use when running training, comparing MLM vs Bridge, or translating configs.
1

Installs

1

Repository

nvidia-nemo/meg…n-bridge

GitHub Stars

577

First Seen

Apr 19, 2026

Security Audits

Gen Agent Trust HubPass