Predicting and Improving Test-Time Scaling via Reward Tail-Guided Search

This skill enables Claude to implement and apply Scaling-Law Guided (SLG) Search, a test-time compute optimization algorithm that replaces naive best-of-N sampling with principled budget allocation. Instead of uniformly sampling N candidates and picking the best, SLG Search fits a Generalized Pareto Distribution (GPD) to the tail of observed rewards, predicts how much improvement additional compute would yield per candidate path, and concentrates remaining budget on the most promising intermediate states. This achieves the same reward quality as best-of-N but with polynomially less compute.

When to Use

When the user wants to optimize how an LLM allocates test-time compute across multiple candidate solutions (e.g., math reasoning, code generation, planning)
When building a best-of-N sampling pipeline and wanting principled guidance on how large N should be or how to split budget across stages
When implementing a search algorithm over LLM outputs that needs to decide which partial solutions to expand further
When the user asks to predict LLM scaling behavior without running exhaustive evaluations
When designing multi-stage reasoning pipelines where intermediate states compete for compute budget
When implementing reward-model-guided tree search and needing a principled node selection criterion beyond simple reward ranking

Key Technique

The Problem with Best-of-N: Vanilla best-of-N generates N independent samples, scores them with a reward model, and returns the best. This is wasteful because it treats every sample path equally. Some intermediate states lead to reward distributions with heavy tails (high upside potential), while others have light tails (diminishing returns). BoN cannot distinguish between them.

Tail-Guided Prediction: SLG Search estimates the tail behavior of rewards at each intermediate state by fitting a Generalized Pareto Distribution to the upper quantile of observed reward samples. The GPD shape parameter (xi) captures whether the reward distribution is heavy-tailed (xi > 0, power-law decay, meaning more compute will likely find much better solutions) or light-tailed (xi near 0, exponential decay, meaning returns diminish quickly). From just m small pilot samples at each state, the algorithm predicts the expected maximum reward achievable with any budget N via the scaling law: V(s, N) ~ baseline + scale * N^(xi). This prediction avoids exhaustive evaluation.

predicting-improving-test-time-scaling

Predicting and Improving Test-Time Scaling via Reward Tail-Guided Search

When to Use

Key Technique