turboquant-pytorch
TurboQuant PyTorch
Skill by ara.so — Daily 2026 Skills collection.
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for compressing LLM KV caches. Achieves 5x compression at 3-bit with 99.5% attention fidelity via two-stage vector quantization.
What It Does
TurboQuant compresses LLM key-value caches to 2–4 bits per coordinate:
- Stage 1: Random orthogonal rotation + Lloyd-Max scalar quantization (MSE-optimal)
- Stage 2: QJL residual correction — 1-bit sign projection that makes inner product estimates unbiased
Result: attention scores remain accurate even when individual vectors look quite different from originals. The algorithm preserves inner products, not vector fidelity.
Compression ratios at 8K context on Qwen2.5-3B (289 MB FP16 baseline):
- 4-bit → 76 MB (3.8x)
- 3-bit → 58 MB (5.0x) ← practical sweet spot
- 2-bit → 40 MB (7.3x)
More from aradotso/trending-skills
openclaw-control-center
Local-first, security-first control center for OpenClaw agents — visibility dashboard with readonly defaults, token attribution, collaboration tracing, and safe write operations.
3.9Kinkos-multi-agent-novel-writing
Multi-agent CLI system for autonomous novel writing, auditing, and revision with human review gates
1.8Keverything-claude-code-harness
Agent harness performance system for Claude Code and other AI coding agents — skills, instincts, memory, hooks, commands, and security scanning
1.6Kagency-agents-ai-specialists
A collection of specialized AI agent personalities for Claude Code, Cursor, Aider, Windsurf, and other AI coding tools — covering engineering, design, marketing, sales, and more.
1.6Kui-ux-pro-max-skill
AI design intelligence skill for building professional UI/UX across multiple platforms with 161 reasoning rules, 67 styles, and automated design system generation
1.5Kunderstand-anything-knowledge-graph
Turn any codebase into an interactive knowledge graph using Claude Code skills — explore, search, and ask questions about any project visually.
1.5K