perf-workload-profiling
Installation
SKILL.md
Workload Profiling
Quick Reference
Pick ONE path based on the workload type:
| Workload | Approach | Section |
|---|---|---|
| Training loop | Manual torch.cuda.synchronize() + time.perf_counter() with warmup |
Loop Workloads — Manual Timing |
| Single kernel or op | Write CUDA event benchmark (pre-allocate, warmup, event pairs) | Non-Loop Workloads — CUDA Event Benchmarking |
| Add timeline labels for nsys | Use @nvtx.annotate decorator or context manager |
NVTX Reference |