cuda-kernel-refine

Installation
SKILL.md

CUDA Kernel Refinement Loop

One change per cycle: baseline → profile → classify → optimize → verify → compare → loop. Multiple simultaneous changes make it impossible to attribute improvement. Revert on regression.

1. Establish Baseline

Discover the project's benchmark infrastructure before running anything:

  1. Check Makefile for bench/profile targets (make bench, make profile, etc.)
  2. Check for benchmark binaries (cargo bench, pytest-benchmark, a custom e2e_bench binary)
  3. Look at CI scripts or README for the canonical benchmark invocation
  4. If no benchmark exists, write a minimal one that exercises the hot path with measurable output

Run with enough iterations for stable results — coefficient of variation <5%.

# Save baseline (JSON or structured output preferred for diff later)
./benchmark --iterations 5 --json-output /tmp/bench-baseline.json
Related skills

More from trevors/dot-claude

Installs
24
GitHub Stars
7
First Seen
Feb 27, 2026