cuda-attention-kernel-patterns

Installation
SKILL.md

ONNX Domain Attention (Opset 23/24) CUDA Patterns

Reusable knowledge from ONNX Attention CUDA development in ORT.

Scope: This skill covers the ONNX domain Attention operator (opset 23/24) implemented at core/providers/cuda/llm/attention.cc. This is separate from the contrib domain MultiHeadAttention / GroupQueryAttention at contrib_ops/cuda/bert/. They share some underlying kernels (CUTLASS FMHA, Flash Attention) and infrastructure (attention_softmax.h) but have different dispatch logic, parameter structs, and eligibility checks.

  • Shared infrastructure: CUTLASS FMHA kernel, Flash kernel, unified unfused kernel (unfused_attention.cu), attention_softmax.h, attention_impl.cu (contrib only)
  • ONNX-specific: Dispatch cascade in attention.cc, ConvertAttnMaskToBias, mask_filter_value cap, parameter bridge to contrib structs, attention_mask_impl.cu
  • Contrib-specific: Own dispatch in contrib MHA/GQA ops, uses contrib::AttentionParameters directly, has XQA kernel, past-present buffer sharing

1. Runner Dispatch Cascade

Installs
2
GitHub Stars
20.5K
First Seen
May 15, 2026
cuda-attention-kernel-patterns — microsoft/onnxruntime