sglang-deepseek-v3-r1-optimization
SGLang DeepSeek V3/R1 Optimization
Overview
This skill covers the DeepSeek V3/R1 optimization ladder that is active in SGLang main. It intentionally excludes the V3.1 parser delta and the V3.2 DSA/NSA sparse-attention stack, which have separate skills.
Current-main snapshot:
- SGLang
origin/main:929e00eeaon2026-04-21 - sgl-cookbook
origin/main:8ec4d03on2026-04-21 - active runtime entry:
python/sglang/srt/models/deepseek_v2.py - DeepSeek V3/R1 entry class:
DeepseekV3ForCausalLM - NextN/MTP entry class:
DeepseekV3ForCausalLMNextN
The historical evidence lives in:
- references/pr-history.md: chronological PR evidence and code-level notes
- references/playbook.md: investigation order, symptom mapping, validation commands
More from bbuf/sglang-auto-driven-skills
h100
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.
30h100-sglang-diffusion
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/data/bbuf/repos/sglang`, and use the ready H100 remote environment for SGLang **diffusion** development and validation. Use when a task needs diffusion model smoke tests, Triton/CUDA kernel validation, torch.compile diffusion checks, or a safe remote copy for diffusion-specific SGLang changes.
29sglang-prod-incident-triage
Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.
29llm-serving-auto-benchmark
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
15sglang-minimax-m2-series-optimization
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when Codex needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
15sglang-torch-profiler-analysis
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.
15