Interval Calibration Check

People state uncertainty as intervals - "two to four weeks, 90 percent sure" - and those intervals are reliably too narrow. Overprecision is the most robust form of overconfidence: stated 90 percent intervals contain the true value far less than 90 percent of the time, and subjective intervals are sometimes only a fraction as wide as the judge's own information would warrant. A stated "90" that historically hits 50 is not a confidence level, it is a habit of speech, and everything downstream that takes the number literally - an expected-value calculation, a risk model, a commitment - inherits the error. This method interrogates the WIDTH of a stated uncertainty: does your 90 mean 90? It runs two coupled moves that both operate on the width and never on the location of the estimate - an equivalent-bet indifference test at elicitation time, and hit-rate scoring against resolved outcomes - and emits a calibration scorecard. The durable move is not asking "how sure are you?" again. It is converting that question into a concrete bet, widening until the bet is genuinely a toss-up, and scoring the stated confidence against the truths that actually arrive.

When to Use

A consequential plan, forecast, or commitment rests on a stated interval or confidence number that has never been audited - the "90 percent sure we ship in Q3" plan, the cost range in a proposal, the confidence column in a decision journal or assumption ledger.
The same person or team makes repeated resolvable estimates, so a track record exists or can accumulate and the scored-feedback half has material to work with.
A method that consumes probability numbers at face value sits immediately downstream (an expected-value decision tree, a risk model) - calibrate the inputs before the arithmetic launders them.
The worry is that the stated confidence is too tight to trust (overprecision), not that the central number is in the wrong place.

When NOT to Use

Do not run it on the agent's own confidence. An LLM posing an equivalent bet to itself has no felt indifference to reveal; the test becomes the same self-report in different words, and verbalized model confidence is itself systematically overconfident (Xiong et al., 2024). This calibrates a human's stated intervals through elicitation. It is not a self-calibration device for the model. This is the central wall.
Do not use it when the problem is the location of the estimate, not the width. A wrong number, well calibrated, is still wrong. Route a wrong central estimate to think-reference-class-forecasting (anchor on the base rate of comparable cases) or think-fermi-estimation (build the number from factors). Wrong number, use those; untrustworthy "sure," use this.
Do not calibrate an interval around a lookupable fact. Where the answer can simply be checked, or no genuine uncertainty exists, calibrating its interval is theater.
Do not present a one-shot bet-test as a full calibration. Without resolvable items only the bet-test half applies, and the bet device is the least-evidenced part of the protocol; say plainly that the scorecard is one-legged.
Do not promise full debiasing. The controlled record shows partial correction with a stubborn residue. Promise tighter honesty about uncertainty, not calibrated certainty.
Do not confuse it with content moves. It never asks what information is missing (that is the consider-the-unknowns move) and never generates a second estimate to average (that is think-dialectical-bootstrapping). It is content-blind: it only asks whether the stated number means what it claims.

think-interval-calibration-check

Interval Calibration Check

When to Use

When NOT to Use