cuda-attention-kernel-patterns
Installation
SKILL.md
ONNX Domain Attention (Opset 23/24) CUDA Patterns
Reusable knowledge from ONNX Attention CUDA development in ORT.
Scope: This skill covers the ONNX domain
Attentionoperator (opset 23/24) implemented atcore/providers/cuda/llm/attention.cc. This is separate from the contrib domainMultiHeadAttention/GroupQueryAttentionatcontrib_ops/cuda/bert/. They share some underlying kernels (CUTLASS FMHA, Flash Attention) and infrastructure (attention_softmax.h) but have different dispatch logic, parameter structs, and eligibility checks.
- Shared infrastructure: CUTLASS FMHA kernel, Flash kernel, unified unfused kernel (
unfused_attention.cu),attention_softmax.h,attention_impl.cu(contrib only)- ONNX-specific: Dispatch cascade in
attention.cc,ConvertAttnMaskToBias,mask_filter_valuecap, parameter bridge to contrib structs,attention_mask_impl.cu- Contrib-specific: Own dispatch in contrib MHA/GQA ops, uses
contrib::AttentionParametersdirectly, has XQA kernel, past-present buffer sharing