build-and-dependency
Build & Dependency Guide
The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.
Why Containers
Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.
Use the container as your development environment. This guarantees:
- Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lockresolves the same way locally and in CI.
More from nvidia/megatron-lm
split-pr
Split a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups.
2respond-to-issue
Research and draft a response to a GitHub issue or question from an external contributor.
2create-issue
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
2onboard-gb200-1node-tests
Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
2cicd
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
1testing
Test system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.
1