cuda-kernels
Installation
SKILL.md
CUDA Kernel Development & Optimization
Skill for developing and optimizing custom CUDA kernels in the candle framework for qwen3-tts-rs.
Trigger Words
cuda kernel, custom kernel, fused op, write kernel, ptx, kernel launch, CustomOp, nsys, ncu, profiling, roofline, occupancy, register pressure
Candle Custom Op Patterns
CustomOp1 (single input tensor)
use candle_core::{CustomOp1, Layout, Shape, DType, backend::BackendStorage, CudaStorage};
struct MyFusedOp { /* params */ }
impl CustomOp1 for MyFusedOp {
fn name(&self) -> &'static str { "my_fused_op" }