add-sgl-kernel
Installation
SKILL.md
Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)
This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement scale(x, factor) = x * factor to demonstrate the complete workflow.
Goal
Add a new operation that scales each element of a tensor by a scalar factor:
- Input: tensor
x(CUDA) and scalarfactor(float) - Output:
x * factor(element-wise, in-place or into pre-allocatedout) - Supported dtypes: FP16 (
torch.float16), BF16 (torch.bfloat16), FP32 (torch.float32)- Dispatched via
DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16macro (defined insgl-kernel/include/utils.h)
- Dispatched via
Two rules of thumb (must follow)
- Prefer
python/sglang/jit_kernelfirst when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration. - Prefer
sgl-kernelwhen the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow. - Exception: if the dependency is
flashinfer, or CUTLASS that is already provided throughflashinfer, the kernel can still be implemented asjit_kernel.