cute-dsl-ref
Installation
SKILL.md
CuTe Python DSL Reference
Execution Model
CuTe Python DSL is the Python surface for NVIDIA's CuTe layout algebra. Unlike cuTile's block-level abstraction, CuTe DSL exposes explicit thread/warp/warpgroup control, TMA pipelines, barrier choreography, and shared memory management.
Two-Level Host/Device Pattern
Every CuTe DSL kernel has two functions:
@cute.jithost function — runs on CPU, sets up TMA descriptors, computes grid, allocates shared memory, launches the kernel@cute.kerneldevice function — runs on GPU, contains the actual computation
import cutlass
import cutlass.cute as cute
Related skills