Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement scale(x, factor) = x * factor to demonstrate the complete workflow.

Goal

Add a new operation that scales each element of a tensor by a scalar factor:

Input: tensor x (CUDA) and scalar factor (float)
Output: x * factor (element-wise, in-place or into pre-allocated out)
Supported dtypes: FP16 (torch.float16), BF16 (torch.bfloat16), FP32 (torch.float32)
- Dispatched via DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro (defined in sgl-kernel/include/utils.h)

Two rules of thumb (must follow)

Prefer python/sglang/jit_kernel first when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
Prefer sgl-kernel when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
Exception: if the dependency is flashinfer, or CUTLASS that is already provided through flashinfer, the kernel can still be implemented as jit_kernel.

add-sgl-kernel

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

Goal

Two rules of thumb (must follow)

Tutorial: Adding a New Kernel to `sgl-kernel` (AOT / Heavyweight)