MindSpeed-MM VLM (Vision-Language Model) Training

This Skill guides users through training multimodal understanding (VLM) models on Huawei Ascend NPU using MindSpeed-MM. It uses Qwen2.5VL-3B as the flagship example and covers the end-to-end fine-tuning workflow.

Prerequisites

Critical: For most VLMs (Qwen2.5VL, Qwen2VL, InternVL, GLM4V, DeepSeekVL2), follow the manual install flow below. Do NOT use bash scripts/install.sh — official MindSpeed-MM docs state it only fully supports Qwen3/Qwen3.5. (For Qwen3VL / Qwen3.5, use one-click install + bash examples/qwen3_5/install_extensions.sh.)

Step P1: Clone repositories

git clone https://gitcode.com/Ascend/MindSpeed-MM.git /root/workspace/MindSpeed-MM
git clone https://github.com/NVIDIA/Megatron-LM.git /root/workspace/Megatron-LM
cd /root/workspace/Megatron-LM && git checkout core_v0.12.1
cp -r megatron /root/workspace/MindSpeed-MM/
cd /root/workspace/MindSpeed-MM

mindspeed-mm-vlm

MindSpeed-MM VLM (Vision-Language Model) Training

Prerequisites

Step P1: Clone repositories

Step P2: Install PyTorch + torch_npu