cuda-auto-tune

Installation
SKILL.md

NCU-driven iterative kernel optimization (CUDA / CUTLASS / Triton / CuTe DSL)

GATE CHECK (enforce before any optimization)

STOP — Do you have NCU profile data for this kernel?
  NO  → Go to Step 1. Do NOT touch any kernel code.
  YES → Go to Step 2.

Hard rules — violation of any rule invalidates the entire optimization:

  • NEVER change kernel code, launch config, or template parameters without NCU data.
  • ALL recommendations MUST cite specific NCU metric values as evidence.
  • Each iteration MUST cover at minimum: roofline, memory hierarchy, warp stalls, occupancy.
  • The optimization playbook MUST match the kernel implementation type.
  • After EVERY code change, re-profile and compare with --diff.
  • Stop iterating when improvements plateau or metrics approach hardware ceiling.

Installs
40
GitHub Stars
19
First Seen
Apr 21, 2026