cpu-kernels
CPU C++ Kernels for x86 Processors
This skill provides patterns and guidance for developing optimized C++ kernels targeting x86 CPUs (Intel Xeon and compatible processors) with AVX2 and AVX512 intrinsics. Kernels are compiled via kernel-builder and distributed through the Hugging Face kernels ecosystem.
Who runs these commands? You, the agent — not a human. This is an autonomous loop: you write/edit the C++ kernel, build it, then run the scripts below as tools (via Bash) to check correctness, benchmark, and profile. You read each result, record it with
trial_manager.py, decide the next change from the Phase 2 decision tree, and repeat until you hitearly_stop_speedupor run allmax_trials.
Key Concepts (read before the Quick Start)
The commands use a few names that mean different things. They are not interchangeable:
| Name (example) | What it is | Used by |
|---|---|---|
baseline.py |
The PyTorch reference implementation you optimize against. It is the ground truth for correctness and the speed reference for speedup. It must define get_inputs() and either get_reference_output() or a Model class (plus optional get_init_inputs()). You write this file (or it is given) before starting. |
every script |
my_rmsnorm |
A trial-tree label — an arbitrary name you pick for this optimization task. trial_manager.py stores all attempts under trials/my_rmsnorm/. It is only a tracking ID. |
trial_manager.py only |
my_kernel |
The installed Python package name — the build artifact produced by kernel-builder build + pip install. This is the importable module that contains your compiled kernel. |
--kernel-package |
my_kernel.rms_norm |
An <package>.<function> path — the actual callable inside the installed package. Passed to --op to tell the benchmark/profiler which function to run. |
--op |
⚠️
--opmeans two different things depending on the script. Inanalyze_op.py,--opis a plain operation name (e.g."rms_norm") used to look up compute/memory characteristics. Inbenchmark_cpu.pyandcpu_profiler.py,--opis apackage.functionpath (e.g.my_kernel.rms_norm) used to import and call your kernel. Same flag, different meaning — read each command below carefully.