analyze-kernel-bottleneck
Installation
SKILL.md
Analyze Kernel Bottleneck
Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound by measuring baseline performance, classifying on the roofline, computing occupancy and compute/load ratio per tile, inspecting SASS instruction mix and stall codes, checking the shared memory cliff, and applying a decision matrix to select the right optimization strategy.
When to Use
- Before optimizing any CUDA kernel -- establish baseline and classify bottleneck type
- After writing a first working version of a kernel to identify the optimization path
- When a kernel underperforms expectations relative to theoretical peak
- When deciding between cp.async, larger tiles, or algorithmic restructuring
Inputs
- Required: Compiled kernel (
.cubinor.cusource with build command) - Required: Benchmark harness that launches the kernel with CUDA event timing
- Required: Problem dimensions (e.g., M, N, K for GEMM; seq_len, heads, head_dim for attention)
- Optional: Target GPU architecture (default: GA104 / sm_86 / RTX 3070 Ti)
- Optional: Expected peak utilization percentage for comparison
- Optional: Prior profiling data (Nsight Compute reports)
Related skills