MoE Long-Context Training

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-long-context/card.yaml

What Changes At Long Context

Once sequence length moves well past the 4K-class regime, attention memory and activation residency become the dominant constraints. For MoE models, that usually means you need some combination of:

context parallelism
selective recompute
lower precision
CPU offload for optimizer state
a dispatcher and PP layout that do not waste the smaller remaining DP budget

Rounded Scaling Patterns

Installs

1.6K

Repository

nvidia/skills

GitHub Stars

2.6K

First Seen

May 29, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass