SGLang Kimi K2/K2.5 Optimization

Overview

The skill is an optimization ladder. Identify which stage the current code is at, apply the next missing optimization, and only move deeper after the earlier stage is satisfied.

Current-main snapshot: This skill was refreshed against SGLang origin/main commit c122d343a on 2026-04-21. Since the older PR ladder was written, current main has added a Kimi-K2.5 usage doc, parser and OpenAI-serving tests for kimi_k2, Kimi-K2.5 LoRA regression coverage, AMD/GB300 validation lanes, and a Kimi-K2-Thinking stress test. Treat those as part of the active validation surface, not as optional CI trivia. Active open PRs now also define several next likely skill updates: W4AFP8 loading, W4A16 DeepEP low-latency, Kimi-K2.5 multimodal processor fixes, ROCm fused QK RMSNorm, and JIT migration of the older K2 fused gate path. One important non-open gap is Kimi-K2-Thinking DeepEP plus int4/Marlin: #13789 tried to support it but was closed unmerged after hitting an illegal memory access in the fused_marlin_moe path. Do not mark that combination as mainline-supported just because the generic Marlin JIT work in #19181 landed.

The historical evidence for every stage lives in:

references/pr-history.md: merged PR evidence, benchmark tables, key code
references/playbook.md: symptom mapping, commands, validation order

Before You Change Anything

Record the exact serving shape first:

sglang-kimi-k2-k25-optimization

SGLang Kimi K2/K2.5 Optimization

Overview

Before You Change Anything

More from bbuf/sglang-auto-driven-skills

h100

h100-sglang-diffusion

sglang-prod-incident-triage

llm-serving-auto-benchmark

llm-torch-profiler-analysis

sglang-sota-performance