MoE VLM Training

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-vlm-training/card.yaml

FSDP vs 3D Parallel

Approach	Strength	Best fit
FSDP	Simplest path to a working multimodal run	first bring-up, memory-first tuning, awkward PP boundaries
3D parallel	Higher ceiling after tuning	stable models with a clean PP layout and time for deeper sweeps

For MoE VLMs, the practical workflow is usually:

Installs

1.6K

Repository

GitHub Stars

2.6K

First Seen

May 29, 2026

Security Audits