Analyze Kernel Bottleneck

Systematically identify whether a GPU kernel is compute-bound, memory-bound, or latency-bound by measuring baseline performance, classifying on the roofline, computing occupancy and compute/load ratio per tile, inspecting SASS instruction mix and stall codes, checking the shared memory cliff, and applying a decision matrix to select the right optimization strategy.

When to Use

Before optimizing any CUDA kernel -- establish baseline and classify bottleneck type
After writing a first working version of a kernel to identify the optimization path
When a kernel underperforms expectations relative to theoretical peak
When deciding between cp.async, larger tiles, or algorithmic restructuring

Inputs

Required: Compiled kernel (.cubin or .cu source with build command)
Required: Benchmark harness that launches the kernel with CUDA event timing
Required: Problem dimensions (e.g., M, N, K for GEMM; seq_len, heads, head_dim for attention)
Optional: Target GPU architecture (default: GA104 / sm_86 / RTX 3070 Ti)
Optional: Expected peak utilization percentage for comparison
Optional: Prior profiling data (Nsight Compute reports)

analyze-kernel-bottleneck

Analyze Kernel Bottleneck

When to Use

Inputs

More from pjt222/agent-almanac

compose-sacred-music

assess-holistic-health

analyze-codebase-for-mcp

adaptic

manage-memory

argumentation