perf-workload-profiling

Installation

SKILL.md

Workload Profiling

Pick ONE path based on the workload type:

Workload	Approach	Section
Training loop	Manual `torch.cuda.synchronize()` + `time.perf_counter()` with warmup	Loop Workloads — Manual Timing
Single kernel or op	Write CUDA event benchmark (pre-allocate, warmup, event pairs)	Non-Loop Workloads — CUDA Event Benchmarking
Add timeline labels for nsys	Use `@nvtx.annotate` decorator or context manager	NVTX Reference

Installs

Repository

GitHub Stars

13.9K

First Seen

May 8, 2026

Security Audits

perf-workload-profiling — nvidia/tensorrt-llm