sparseeval-evaluation-sparse-optimization

Installation
SKILL.md

SparseEval: Efficient LLM Evaluation via Sparse Optimization

This skill enables Claude to help users drastically reduce LLM evaluation costs by selecting a small, representative subset of benchmark items (called anchors) whose results can predict full-benchmark performance. The core technique from the ICLR 2026 paper SparseEval formulates benchmark reduction as a sparse optimization problem: construct a model-item performance matrix, discover its inherent sparsity via spectral clustering, select anchors through k-means initialization, optimize anchor weights with an MLP trained via gradient descent, and iteratively refine the anchor set using Anchor Importance Scores (AIS) and Candidate Importance Scores (CIS). The result: evaluate on ~100 items instead of thousands while maintaining Kendall's tau > 0.9 rank correlation with full evaluation.

When to Use

  • When the user wants to reduce the number of benchmark items they need to run inference on to rank or compare LLMs
  • When building a custom evaluation suite and needs to select the most informative test items from a large pool
  • When the user has a model-item performance matrix (rows = models, columns = test items, entries = correct/incorrect) and wants to find which items matter most
  • When the user asks to rank models cheaply without full benchmark runs, using historical evaluation data from leaderboards
  • When designing a lightweight evaluation pipeline for CI/CD or rapid model iteration where full benchmarks are too expensive
  • When the user wants to identify redundant test items in an existing benchmark through sparsity analysis

Key Technique

The Sparsity Insight. SparseEval starts from the observation that the binary model-item performance matrix (1 for correct, -1 for incorrect) across thousands of models is inherently sparse and clustered. When you compute cosine similarity between item vectors and apply spectral clustering, pronounced diagonal blocks emerge -- groups of items that models tend to get right or wrong together. This redundancy means a small subset of "anchor" items can represent the entire benchmark.

MLP-Based Weight Optimization. Given a set of anchor items, SparseEval trains a small MLP to learn weights that minimize the reconstruction loss between the weighted anchor scores and the true full-benchmark scores. The loss is L = (1/M) * ||f(S_train * (1_M * W^T)) - S_train * W_a||_2 where f is the MLP, S_train is the performance matrix for training models, W encodes sparse anchor weights, and W_a is the uniform average. The MLP's nonlinearity captures complex item interactions that linear weighting misses -- deeper architectures significantly outperform linear baselines.

Related skills
Installs
1
GitHub Stars
3
First Seen
Apr 21, 2026