model-pruning
SKILL.md
Model Pruning: Compressing LLMs
When to Use This Skill
Use Model Pruning when you need to:
- Reduce model size by 40-60% with <1% accuracy loss
- Accelerate inference using hardware-friendly sparsity (2-4× speedup)
- Deploy on constrained hardware (mobile, edge devices)
- Compress without retraining using one-shot methods
- Enable efficient serving with reduced memory footprint
Key Techniques: Wanda (weights × activations), SparseGPT (second-order), structured pruning, N:M sparsity
Papers: Wanda ICLR 2024 (arXiv 2306.11695), SparseGPT (arXiv 2301.00774)