optimizing-attention-flash
Installation
SKILL.md
Flash Attention - Fast Memory-Efficient Attention
Quick start
Flash Attention provides 2-4x speedup and 10-20x memory reduction for transformer attention through IO-aware tiling and recomputation.
PyTorch native (easiest, PyTorch 2.2+):
import torch
import torch.nn.functional as F
q = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16) # [batch, heads, seq, dim]
k = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)
v = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)
# Automatically uses Flash Attention if available
out = F.scaled_dot_product_attention(q, k, v)
Related skills
More from nousresearch/hermes-agent
dogfood
Exploratory QA of web apps: find bugs, evidence, reports.
2.4Kyuanbao
Yuanbao (元宝) groups: @mention users, query info/members.
161llm-wiki
Karpathy's LLM Wiki: build/query interlinked markdown KB.
20manim-video
Manim CE animations: 3Blue1Brown math/algo videos.
15powerpoint
Create, read, edit .pptx decks, slides, notes, templates.
14ocr-and-documents
Extract text from PDFs/scans (pymupdf, marker-pdf).
14