build-and-dependency

Installation

SKILL.md

Build & Dependency Guide

The core principle: build and develop inside containers — the CI container ships the correct CUDA toolkit, PyTorch build, and pre-compiled native extensions (TransformerEngine, DeepEP, …) that cannot be reproduced on a bare host.

Why Containers

Megatron-LM depends on CUDA, NCCL, PyTorch with GPU support, TransformerEngine, and optional components like ModelOpt and DeepEP. Installing these on a bare host is fragile and hard to reproduce. The project ships Dockerfiles that pin every dependency.

Use the container as your development environment. This guarantees:

Identical CUDA / NCCL / cuDNN versions across all developers and CI.
uv.lock resolves the same way locally and in CI.

Related skills

build-and-dependency

Build & Dependency Guide

Why Containers

More from nvidia/megatron-lm

split-pr

respond-to-issue

create-issue

onboard-gb200-1node-tests

cicd

testing