mla-analysis

Installation
SKILL.md

MLA Cost Analysis & Regime Guide

Regime Selection

Regime s Range Best Kernel Why
Decode s=1 FlashMLA 16x latency reduction vs FlashAttention (compressed KV)
Speculative s=2-32 MLAvar6+ or FlashMLA MLAvar6+ should be able to beat FlashMLA and FlashAttention
Prefill s>128 FlashAttention Avoids 4x FLOP penalty of latent-space compute

Crossover point: FlashAttention becomes faster than FlashMLA at approximately s=16-32 for DeepSeek-V3 parameters.

Cost Models (DeepSeek-V3-like: h=128, d=128, k=512, p=64)

FlashAttention

  • FLOPs: 2bhst(2d + p) = 2bhst * 320
  • Bytes: w * bh(s+t)(2d + p) = w * bh(s+t) * 320
  • At s=1: AI ≈ 1 FLOP/byte (deeply memory-bound)
  • At s=1024: AI ≈ 819 FLOP/byte (deeply compute-bound)
Related skills
Installs
35
First Seen
Apr 21, 2026