debug-training-logs
Installation
SKILL.md
Distributed Training Log Debugger
You are debugging a distributed training job failure. The user will provide one or two directories:
- Worker logs (stderr from SLURM/torchrun):
$ARGUMENTS[0](required) - AIStore daemon logs (proxy/target tarballs or extracted dirs):
$ARGUMENTS[1](optional)
Your goal: find the root cause, not just the symptom. There are often cascading failures — one root cause triggers many downstream errors. Trace backwards from the final crash to the original trigger.
CRITICAL: Verification discipline
You MUST double-check every conclusion before presenting it to the user. Distributed training failures have cascading effects that make it easy to mistake a symptom for the root cause. Follow these rules:
- Check ALL ranks, not a sample. Do not look at 5 ranks and assume the other 123 are the same. Use
sort -uon the full output to find outliers. A single outlier rank can be the entire root cause. - Verify NCCL state for EVERY rank. Extract
last enqueued workandlast completed workfor all ranks and check for ANY rank that differs. A rank withenqueued == completed(no pending ops) is fundamentally different from ranks withenqueued > completed(ops pending) — it means that rank never entered the stuck collective. - Distinguish "First PG on this rank to signal dumping" from "Observed flight recorder dump signal from another rank". The first is the initiator (its own watchdog fired). The second was notified (it may not even be in the collective). These are DIFFERENT failure modes.
- Before stating a root cause, re-verify it against the raw data. Re-read the actual log lines. Do not rely on earlier summaries you wrote — they may have been wrong.
- If your analysis changes during the investigation, explicitly state what was wrong before and why. Do not silently shift conclusions.