cuda-skill
CUDA & PTX Reference
Documentation Locations
All documentation is under the references/ directory within this skill's install location.
The base path depends on which agent tool is used:
- Cursor:
~/.cursor/skills/cuda-skill/references/ - Claude Code:
~/.claude/skills/cuda-skill/references/ - Codex:
~/.codex/skills/cuda-skill/references/
Below, CUDA_REFS refers to the references/ directory inside the skill's install path.
For example: ~/.cursor/skills/cuda-skill/references/ (Cursor) or ~/.claude/skills/cuda-skill/references/ (Claude Code).
Replace with the actual path in all search commands.
references/
├── ptx-docs/ # PTX ISA 9.1 full spec (405 files, 2.3MB)
├── ptx-simple/ # PTX condensed quick-ref (13 files, 149KB)
├── cuda-runtime-docs/ # CUDA Runtime API 13.1 (107 files, 0.9MB)
More from slowlyc/agent-gpu-skills
triton-skill
Write, debug, and optimize Triton and Gluon GPU kernels using local source code, tutorials, and kernel references. Use when the user mentions Triton, Gluon, tl.load, tl.store, tl.dot, tl.dot_scaled, triton.jit, gluon.jit, wgmma, tcgen05, TMA, tensor descriptor, persistent kernel, warp specialization, fused attention, matmul kernel, kernel fusion, tl.program_id, triton autotune, MXFP, FP8, FP4, NVFP4, block-scaled matmul, SwiGLU, top-k, triton_kernels, roofline analysis, Triton IR, TritonGPU dialect, MLIR Triton, PDL (programmatic dependent launch), cluster launch control, or asks about writing GPU kernels in Python. Also use when the user wants to understand Triton compiler internals, debug Triton kernel correctness, profile Triton kernel performance, or convert CUDA kernels to Triton.
63cutlass-skill
Write, debug, and optimize CUTLASS and CuTeDSL GPU kernels using local source code, examples, and header references. Use when the user mentions CUTLASS, CuTe, CuTeDSL, cute::Layout, cute::Tensor, TiledMMA, TiledCopy, CollectiveMainloop, CollectiveEpilogue, GEMM kernel, grouped GEMM, sparse GEMM, flash attention CUTLASS, blackwell GEMM, hopper GEMM, FP8 GEMM, FP4 GEMM, blockwise scaling, MoE GEMM, StreamK, warp specialization CUTLASS, TMA CUTLASS, epilogue fusion, EVT (Epilogue Visitor Tree), pycute, Layout algebra, Swizzle pattern, GemmUniversal, KernelSchedule, EpilogueSchedule, CUTLASS collective builder, CUTLASS pipeline, or asks about writing high-performance CUDA kernels with CUTLASS/CuTe templates. Also use when the user wants to understand CUTLASS source code structure, compile CUTLASS examples, or debug CUTLASS template errors.
50sglang-skill
Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer backend, MLA, MoE routing, MoE dispatch, expert parallelism SGLang, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph SGLang, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, DeepSeek serving, EPLB (expert load balancing), HiCache, launch_server, sglang Engine API, LoRA inference, torch.compile SGLang, or asks about serving LLMs with SGLang. Also use when the user wants to add a new model to SGLang, add a new attention backend, debug SGLang serving issues, or optimize SGLang throughput/latency.
46