analyze-kernel-bottleneck

Installation
SKILL.md

Analyze Kernel Bottleneck

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound by measuring baseline performance, classifying on the roofline, computing occupancy and compute/load ratio per tile, inspecting SASS instruction mix and stall codes, checking the shared memory cliff, and applying a decision matrix to select the right optimization strategy.

When to Use

  • Before optimizing any CUDA kernel -- establish baseline and classify bottleneck type
  • After writing a first working version of a kernel to identify the optimization path
  • When a kernel underperforms expectations relative to theoretical peak
  • When deciding between cp.async, larger tiles, or algorithmic restructuring

Inputs

  • Required: Compiled kernel (.cubin or .cu source with build command)
  • Required: Benchmark harness that launches the kernel with CUDA event timing
  • Required: Problem dimensions (e.g., M, N, K for GEMM; seq_len, heads, head_dim for attention)
  • Optional: Target GPU architecture (default: GA104 / sm_86 / RTX 3070 Ti)
  • Optional: Expected peak utilization percentage for comparison
  • Optional: Prior profiling data (Nsight Compute reports)
Related skills
Installs
1
GitHub Stars
13
First Seen
14 days ago