cuda-auto-tune
Installation
SKILL.md
NCU-driven iterative kernel optimization (CUDA / CUTLASS / Triton / CuTe DSL)
GATE CHECK (enforce before any optimization)
STOP — Do you have NCU profile data for this kernel?
NO → Go to Step 1. Do NOT touch any kernel code.
YES → Go to Step 2.
Hard rules — violation of any rule invalidates the entire optimization:
- NEVER change kernel code, launch config, or template parameters without NCU data.
- ALL recommendations MUST cite specific NCU metric values as evidence.
- Each iteration MUST cover at minimum: roofline, memory hierarchy, warp stalls, occupancy.
- The optimization playbook MUST match the kernel implementation type.
- After EVERY code change, re-profile and compare with
--diff. - Stop iterating when improvements plateau or metrics approach hardware ceiling.