cutlass-skill
CUTLASS & CuTeDSL Development
Source Code Locations
CUTLASS 源码位于此 skill 安装目录下的 repos/cutlass/。
实际路径取决于所用工具:
- Cursor:
~/.cursor/skills/cutlass-skill/repos/cutlass/ - Claude Code:
~/.claude/skills/cutlass-skill/repos/cutlass/ - Codex:
~/.codex/skills/cutlass-skill/repos/cutlass/
CUTLASS_REPO: 下文示例用 ~/.cursor/skills/cutlass-skill/repos/cutlass/ 作占位符,替换为实际路径。
如果该路径不存在,在项目目录下运行 bash update-repos.sh cutlass。
CuTeDSL (Python DSL for GPU Kernels)
CUTLASS_REPO/python/CuTeDSL/
├── cutlass/
More from slowlyc/agent-gpu-skills
cuda-skill
Query NVIDIA PTX ISA 9.1, CUDA Runtime API 13.1, Driver API 13.1, Programming Guide v13.1, Best Practices Guide, Nsight Compute, Nsight Systems local documentation. Debug and optimize GPU kernels with nsys/ncu/compute-sanitizer workflows. Use when writing, debugging, or optimizing CUDA code, GPU kernels, PTX instructions, inline PTX, TensorCore operations (WMMA, WGMMA, TMA, tcgen05), or when the user mentions CUDA API functions, error codes, device properties, memory management, profiling, GPU performance, compute capabilities, CUDA Graphs, Cooperative Groups, Unified Memory, dynamic parallelism, CUDA programming model concepts, bank conflicts, shared memory optimization, warp divergence, memory coalescing, occupancy tuning, register pressure, L2 cache control, async copy, mbarrier, thread block clusters, or CUDA architecture questions (Ampere sm_80, Hopper sm_90, Blackwell sm_100).
64triton-skill
Write, debug, and optimize Triton and Gluon GPU kernels using local source code, tutorials, and kernel references. Use when the user mentions Triton, Gluon, tl.load, tl.store, tl.dot, tl.dot_scaled, triton.jit, gluon.jit, wgmma, tcgen05, TMA, tensor descriptor, persistent kernel, warp specialization, fused attention, matmul kernel, kernel fusion, tl.program_id, triton autotune, MXFP, FP8, FP4, NVFP4, block-scaled matmul, SwiGLU, top-k, triton_kernels, roofline analysis, Triton IR, TritonGPU dialect, MLIR Triton, PDL (programmatic dependent launch), cluster launch control, or asks about writing GPU kernels in Python. Also use when the user wants to understand Triton compiler internals, debug Triton kernel correctness, profile Triton kernel performance, or convert CUDA kernels to Triton.
63sglang-skill
Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer backend, MLA, MoE routing, MoE dispatch, expert parallelism SGLang, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph SGLang, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, DeepSeek serving, EPLB (expert load balancing), HiCache, launch_server, sglang Engine API, LoRA inference, torch.compile SGLang, or asks about serving LLMs with SGLang. Also use when the user wants to add a new model to SGLang, add a new attention backend, debug SGLang serving issues, or optimize SGLang throughput/latency.
46