nemo-mbridge-perf-moe-long-context

Installation
SKILL.md

MoE Long-Context Training

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-long-context/card.yaml

What Changes At Long Context

Once sequence length moves well past the 4K-class regime, attention memory and activation residency become the dominant constraints. For MoE models, that usually means you need some combination of:

  • context parallelism
  • selective recompute
  • lower precision
  • CPU offload for optimizer state
  • a dispatcher and PP layout that do not waste the smaller remaining DP budget

Rounded Scaling Patterns

Installs
135
Repository
nvidia/skills
GitHub Stars
1.0K
First Seen
7 days ago
nemo-mbridge-perf-moe-long-context — nvidia/skills