debug-training-logs

Installation
SKILL.md

Distributed Training Log Debugger

You are debugging a distributed training job failure. The user will provide one or two directories:

  • Worker logs (stderr from SLURM/torchrun): $ARGUMENTS[0] (required)
  • AIStore daemon logs (proxy/target tarballs or extracted dirs): $ARGUMENTS[1] (optional)

Your goal: find the root cause, not just the symptom. There are often cascading failures — one root cause triggers many downstream errors. Trace backwards from the final crash to the original trigger.

CRITICAL: Verification discipline

You MUST double-check every conclusion before presenting it to the user. Distributed training failures have cascading effects that make it easy to mistake a symptom for the root cause. Follow these rules:

  1. Check ALL ranks, not a sample. Do not look at 5 ranks and assume the other 123 are the same. Use sort -u on the full output to find outliers. A single outlier rank can be the entire root cause.
  2. Verify NCCL state for EVERY rank. Extract last enqueued work and last completed work for all ranks and check for ANY rank that differs. A rank with enqueued == completed (no pending ops) is fundamentally different from ranks with enqueued > completed (ops pending) — it means that rank never entered the stuck collective.
  3. Distinguish "First PG on this rank to signal dumping" from "Observed flight recorder dump signal from another rank". The first is the initiator (its own watchdog fired). The second was notified (it may not even be in the collective). These are DIFFERENT failure modes.
  4. Before stating a root cause, re-verify it against the raw data. Re-read the actual log lines. Do not rely on earlier summaries you wrote — they may have been wrong.
  5. If your analysis changes during the investigation, explicitly state what was wrong before and why. Do not silently shift conclusions.

Phase 0: Obtain logs

Installs
13
GitHub Stars
17.2K
First Seen
Apr 16, 2026