sglang-prod-incident-triage

Installation
SKILL.md

SGLang Serving Debug

Overview

Use this skill to turn a live serving problem into a debug path you can replay.

Use one loop:

  • collect a baseline bundle
  • save the failing request or crash dump
  • replay on a clean target
  • only then switch tools

Do not start with profiling.

This skill should work with more focused skills instead of re-implementing them:

  • debug-cuda-crash when replay plus coredump points to a CUDA crash path
  • debug-distributed-hang when the problem is clearly a TP/PP/DP/EP hang
Related skills

More from bbuf/sglang-auto-driven-skills

Installs
33
GitHub Stars
272
First Seen
Apr 21, 2026