fr-analysis
Installation
SKILL.md
Skill: fr_analysis
Analyze PyTorch NCCL flight-recorder (FR) dumps to identify the collective operation hang
and isolate the ranks responsible, using CollectiveAnalyzer.
Script: scripts/fr_attribution.py → attribution/trace_analyzer/fr_attribution.py
What it does
- Loads all FR dump files matching a glob pattern under
--fr-path. - Parses each dump into
Collectiverecords (op type, ranks, process group, timing, state). - Groups collectives by process group and sequence ID across ranks to detect mismatches.
- Identifies the wavefront — the process group boundary where collectives diverge — and returns the missing ranks at that boundary as the root-cause suspects.
- Optionally runs an LLM pass (
--llm-analyze) over the structured findings for a human-readable summary.