add-sgl-kernel

Installation
SKILL.md

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement scale(x, factor) = x * factor to demonstrate the complete workflow.

Goal

Add a new operation that scales each element of a tensor by a scalar factor:

  • Input: tensor x (CUDA) and scalar factor (float)
  • Output: x * factor (element-wise, in-place or into pre-allocated out)
  • Supported dtypes: FP16 (torch.float16), BF16 (torch.bfloat16), FP32 (torch.float32)
    • Dispatched via DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro (defined in sgl-kernel/include/utils.h)

Two rules of thumb (must follow)

  1. Prefer python/sglang/jit_kernel first when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.
  2. Prefer sgl-kernel when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.
  3. Exception: if the dependency is flashinfer, or CUTLASS that is already provided through flashinfer, the kernel can still be implemented as jit_kernel.
Installs
30
GitHub Stars
27.6K
First Seen
Feb 27, 2026