sparseeval-evaluation-sparse-optimization
SparseEval: Efficient LLM Evaluation via Sparse Optimization
This skill enables Claude to help users drastically reduce LLM evaluation costs by selecting a small, representative subset of benchmark items (called anchors) whose results can predict full-benchmark performance. The core technique from the ICLR 2026 paper SparseEval formulates benchmark reduction as a sparse optimization problem: construct a model-item performance matrix, discover its inherent sparsity via spectral clustering, select anchors through k-means initialization, optimize anchor weights with an MLP trained via gradient descent, and iteratively refine the anchor set using Anchor Importance Scores (AIS) and Candidate Importance Scores (CIS). The result: evaluate on ~100 items instead of thousands while maintaining Kendall's tau > 0.9 rank correlation with full evaluation.
When to Use
- When the user wants to reduce the number of benchmark items they need to run inference on to rank or compare LLMs
- When building a custom evaluation suite and needs to select the most informative test items from a large pool
- When the user has a model-item performance matrix (rows = models, columns = test items, entries = correct/incorrect) and wants to find which items matter most
- When the user asks to rank models cheaply without full benchmark runs, using historical evaluation data from leaderboards
- When designing a lightweight evaluation pipeline for CI/CD or rapid model iteration where full benchmarks are too expensive
- When the user wants to identify redundant test items in an existing benchmark through sparsity analysis
Key Technique
The Sparsity Insight. SparseEval starts from the observation that the binary model-item performance matrix (1 for correct, -1 for incorrect) across thousands of models is inherently sparse and clustered. When you compute cosine similarity between item vectors and apply spectral clustering, pronounced diagonal blocks emerge -- groups of items that models tend to get right or wrong together. This redundancy means a small subset of "anchor" items can represent the entire benchmark.
MLP-Based Weight Optimization. Given a set of anchor items, SparseEval trains a small MLP to learn weights that minimize the reconstruction loss between the weighted anchor scores and the true full-benchmark scores. The loss is L = (1/M) * ||f(S_train * (1_M * W^T)) - S_train * W_a||_2 where f is the MLP, S_train is the performance matrix for training models, W encodes sparse anchor weights, and W_a is the uniform average. The MLP's nonlinearity captures complex item interactions that linear weighting misses -- deeper architectures significantly outperform linear baselines.
More from ndpvt-web/arxiv-claude-skills
predictive-coding-information-bottleneck
>
1supchain-bench-benchmarking-real-world-supply
Build reliable long-horizon supply chain agents using the SupChain-ReAct pattern: multi-path ReAct trajectories with majority voting for autonomous tool orchestration without handcrafted SOPs. Use when asked to 'build a supply chain agent', 'orchestrate multi-step tool calls for order management', 'diagnose fulfillment issues', 'create an SOP-free agent workflow', 'implement long-horizon tool calling', or 'build an e-commerce order diagnostic system'.
1pcbschemagen-constraint-guided-schematic-design
Generate PCB schematics from natural language using constraint-guided LLM code generation with knowledge-graph verification. Use when the user says 'generate a PCB schematic', 'design a circuit board', 'create a KiCad schematic from description', 'convert circuit requirements to netlist', 'automate schematic design', or 'generate SKiDL code for a circuit'.
1predicting-improving-test-time-scaling
Implement Scaling-Law Guided (SLG) Search for test-time compute optimization. Uses reward tail distribution estimation (GPD fitting) to predict scaling laws and dynamically allocate compute budget across candidate solutions. Trigger phrases: 'optimize test-time compute', 'best-of-N scaling', 'SLG search', 'tail-guided search', 'reward-guided budget allocation', 'test-time scaling law'
1