predicting-improving-test-time-scaling

Installation
SKILL.md

Predicting and Improving Test-Time Scaling via Reward Tail-Guided Search

This skill enables Claude to implement and apply Scaling-Law Guided (SLG) Search, a test-time compute optimization algorithm that replaces naive best-of-N sampling with principled budget allocation. Instead of uniformly sampling N candidates and picking the best, SLG Search fits a Generalized Pareto Distribution (GPD) to the tail of observed rewards, predicts how much improvement additional compute would yield per candidate path, and concentrates remaining budget on the most promising intermediate states. This achieves the same reward quality as best-of-N but with polynomially less compute.

When to Use

  • When the user wants to optimize how an LLM allocates test-time compute across multiple candidate solutions (e.g., math reasoning, code generation, planning)
  • When building a best-of-N sampling pipeline and wanting principled guidance on how large N should be or how to split budget across stages
  • When implementing a search algorithm over LLM outputs that needs to decide which partial solutions to expand further
  • When the user asks to predict LLM scaling behavior without running exhaustive evaluations
  • When designing multi-stage reasoning pipelines where intermediate states compete for compute budget
  • When implementing reward-model-guided tree search and needing a principled node selection criterion beyond simple reward ranking

Key Technique

The Problem with Best-of-N: Vanilla best-of-N generates N independent samples, scores them with a reward model, and returns the best. This is wasteful because it treats every sample path equally. Some intermediate states lead to reward distributions with heavy tails (high upside potential), while others have light tails (diminishing returns). BoN cannot distinguish between them.

Tail-Guided Prediction: SLG Search estimates the tail behavior of rewards at each intermediate state by fitting a Generalized Pareto Distribution to the upper quantile of observed reward samples. The GPD shape parameter (xi) captures whether the reward distribution is heavy-tailed (xi > 0, power-law decay, meaning more compute will likely find much better solutions) or light-tailed (xi near 0, exponential decay, meaning returns diminish quickly). From just m small pilot samples at each state, the algorithm predicts the expected maximum reward achievable with any budget N via the scaling law: V(s, N) ~ baseline + scale * N^(xi). This prediction avoids exhaustive evaluation.

Related skills

More from ndpvt-web/arxiv-claude-skills

Installs
1
GitHub Stars
3
First Seen
Apr 21, 2026