mla-analysis
Installation
SKILL.md
MLA Cost Analysis & Regime Guide
Regime Selection
| Regime | s Range | Best Kernel | Why |
|---|---|---|---|
| Decode | s=1 | FlashMLA | 16x latency reduction vs FlashAttention (compressed KV) |
| Speculative | s=2-32 | MLAvar6+ or FlashMLA | MLAvar6+ should be able to beat FlashMLA and FlashAttention |
| Prefill | s>128 | FlashAttention | Avoids 4x FLOP penalty of latent-space compute |
Crossover point: FlashAttention becomes faster than FlashMLA at approximately s=16-32 for DeepSeek-V3 parameters.
Cost Models (DeepSeek-V3-like: h=128, d=128, k=512, p=64)
FlashAttention
- FLOPs:
2bhst(2d + p)=2bhst * 320 - Bytes:
w * bh(s+t)(2d + p)=w * bh(s+t) * 320 - At s=1: AI ≈ 1 FLOP/byte (deeply memory-bound)
- At s=1024: AI ≈ 819 FLOP/byte (deeply compute-bound)
Related skills