SparseEval: Efficient LLM Evaluation via Sparse Optimization

This skill enables Claude to help users drastically reduce LLM evaluation costs by selecting a small, representative subset of benchmark items (called anchors) whose results can predict full-benchmark performance. The core technique from the ICLR 2026 paper SparseEval formulates benchmark reduction as a sparse optimization problem: construct a model-item performance matrix, discover its inherent sparsity via spectral clustering, select anchors through k-means initialization, optimize anchor weights with an MLP trained via gradient descent, and iteratively refine the anchor set using Anchor Importance Scores (AIS) and Candidate Importance Scores (CIS). The result: evaluate on ~100 items instead of thousands while maintaining Kendall's tau > 0.9 rank correlation with full evaluation.

When to Use

When the user wants to reduce the number of benchmark items they need to run inference on to rank or compare LLMs
When building a custom evaluation suite and needs to select the most informative test items from a large pool
When the user has a model-item performance matrix (rows = models, columns = test items, entries = correct/incorrect) and wants to find which items matter most
When the user asks to rank models cheaply without full benchmark runs, using historical evaluation data from leaderboards
When designing a lightweight evaluation pipeline for CI/CD or rapid model iteration where full benchmarks are too expensive
When the user wants to identify redundant test items in an existing benchmark through sparsity analysis

Key Technique

The Sparsity Insight. SparseEval starts from the observation that the binary model-item performance matrix (1 for correct, -1 for incorrect) across thousands of models is inherently sparse and clustered. When you compute cosine similarity between item vectors and apply spectral clustering, pronounced diagonal blocks emerge -- groups of items that models tend to get right or wrong together. This redundancy means a small subset of "anchor" items can represent the entire benchmark.

MLP-Based Weight Optimization. Given a set of anchor items, SparseEval trains a small MLP to learn weights that minimize the reconstruction loss between the weighted anchor scores and the true full-benchmark scores. The loss is L = (1/M) * ||f(S_train * (1_M * W^T)) - S_train * W_a||_2 where f is the MLP, S_train is the performance matrix for training models, W encodes sparse anchor weights, and W_a is the uniform average. The MLP's nonlinearity captures complex item interactions that linear weighting misses -- deeper architectures significantly outperform linear baselines.

sparseeval-evaluation-sparse-optimization

SparseEval: Efficient LLM Evaluation via Sparse Optimization

When to Use

Key Technique