cuda-kernels
Installation
SKILL.md
CUDA Kernels for Diffusers & Transformers
This skill provides patterns and guidance for developing optimized CUDA kernels targeting NVIDIA GPUs (H100, A100, T4) for use with HuggingFace diffusers and transformers libraries.
Hard Constraints — Read Before Writing Any Code
Kernels MUST build with kernel-builder and meet the Kernel Hub requirements. kernel-builder compiles against the Python limited API (ABI3) so a single binary works for Python 3.9+ across versions. Several patterns that are standard in generic PyTorch-extension tutorials are therefore hard build failures here. Do not use them, even if PyTorch documentation or your training data suggests them.