perf-workload-profiling

Installation
SKILL.md

Workload Profiling

Quick Reference

Pick ONE path based on the workload type:

Workload Approach Section
Training loop Manual torch.cuda.synchronize() + time.perf_counter() with warmup Loop Workloads — Manual Timing
Single kernel or op Write CUDA event benchmark (pre-allocate, warmup, event pairs) Non-Loop Workloads — CUDA Event Benchmarking
Add timeline labels for nsys Use @nvtx.annotate decorator or context manager NVTX Reference

Principles

Installs
1
GitHub Stars
13.9K
First Seen
May 8, 2026
perf-workload-profiling — nvidia/tensorrt-llm