run-on-slurm
Installation
SKILL.md
Run Megatron-LM on SLURM
Prerequisites
- A SLURM cluster login with submission rights to a GPU partition.
- Megatron-LM checked out on a filesystem visible to all nodes in the allocation (NFS, Lustre, or similar). All nodes must reach the same paths for code, data, checkpoints, and output.
uvinstalled; runuv sync --extra training --extra dev(or--extra lts) on the worktree once before submission so the.venvis materialized and visible to every node.
Minimal sbatch script
Save as run_megatron.slurm in the worktree:
#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
Related skills
More from nvidia/megatron-lm
split-pr
Split a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups.
2respond-to-issue
Research and draft a response to a GitHub issue or question from an external contributor.
2create-issue
Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.
2build-and-dependency
Container-based dev environment setup and dependency management for Megatron-LM. Covers acquiring and launching the CI container, uv package management, and updating uv.lock.
2onboard-gb200-1node-tests
Onboard 1-node GitHub MR functional tests for GB200 from existing mr-scoped 2-node tests.
2cicd
CI/CD reference for Megatron-LM. Covers CI pipeline structure, PR scope labels, triggering internal GitLab CI, and CI failure investigation.
1