fr-analysis

Installation
SKILL.md

Skill: fr_analysis

Analyze PyTorch NCCL flight-recorder (FR) dumps to identify the collective operation hang and isolate the ranks responsible, using CollectiveAnalyzer.

Script: scripts/fr_attribution.pyattribution/trace_analyzer/fr_attribution.py


What it does

  1. Loads all FR dump files matching a glob pattern under --fr-path.
  2. Parses each dump into Collective records (op type, ranks, process group, timing, state).
  3. Groups collectives by process group and sequence ID across ranks to detect mismatches.
  4. Identifies the wavefront — the process group boundary where collectives diverge — and returns the missing ranks at that boundary as the root-cause suspects.
  5. Optionally runs an LLM pass (--llm-analyze) over the structured findings for a human-readable summary.
Installs
1
GitHub Stars
298
First Seen
9 days ago
fr-analysis — nvidia/nvidia-resiliency-ext