nemo-mbridge-perf-moe-vlm-training
Installation
SKILL.md
MoE VLM Training
Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-vlm-training/card.yaml
FSDP vs 3D Parallel
| Approach | Strength | Best fit |
|---|---|---|
| FSDP | Simplest path to a working multimodal run | first bring-up, memory-first tuning, awkward PP boundaries |
| 3D parallel | Higher ceiling after tuning | stable models with a clean PP layout and time for deeper sweeps |
For MoE VLMs, the practical workflow is usually:
- get the first reliable run with FSDP
- stabilize real-data input, recompute, and memory behavior
- move to 3D parallel only if the throughput headroom is worth the extra work