testsystem
Installation
SKILL.md
Test System & CI Guide
Test Layout
tests/
├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/ # end-to-end shell + training scripts
│ └── test_cases/
│ └── {model}/{test_case}/
│ ├── model_config.yaml # training args
│ └── golden_values_{env}_{platform}.json
└── test_utils/
├── recipes/
│ ├── h100/ # YAML recipes for H100 jobs
│ └── gb200/ # YAML recipes for GB200 jobs
└── python_scripts/ # helpers (recipe_parser, golden-value download, …)
Related skills
More from nvidia/megatron-lm
split-pr
Split a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups.
2respond-to-issue
Research and draft a response to a GitHub issue or question from an external contributor.
2create-issue
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
2build-and-dependency
Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
2onboard-gb200-1node-tests
Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
2cicd
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
1