think-interval-calibration-check
Interval Calibration Check
People state uncertainty as intervals - "two to four weeks, 90 percent sure" - and those intervals are reliably too narrow. Overprecision is the most robust form of overconfidence: stated 90 percent intervals contain the true value far less than 90 percent of the time, and subjective intervals are sometimes only a fraction as wide as the judge's own information would warrant. A stated "90" that historically hits 50 is not a confidence level, it is a habit of speech, and everything downstream that takes the number literally - an expected-value calculation, a risk model, a commitment - inherits the error. This method interrogates the WIDTH of a stated uncertainty: does your 90 mean 90? It runs two coupled moves that both operate on the width and never on the location of the estimate - an equivalent-bet indifference test at elicitation time, and hit-rate scoring against resolved outcomes - and emits a calibration scorecard. The durable move is not asking "how sure are you?" again. It is converting that question into a concrete bet, widening until the bet is genuinely a toss-up, and scoring the stated confidence against the truths that actually arrive.
When to Use
- A consequential plan, forecast, or commitment rests on a stated interval or confidence number that has never been audited - the "90 percent sure we ship in Q3" plan, the cost range in a proposal, the confidence column in a decision journal or assumption ledger.
- The same person or team makes repeated resolvable estimates, so a track record exists or can accumulate and the scored-feedback half has material to work with.
- A method that consumes probability numbers at face value sits immediately downstream (an expected-value decision tree, a risk model) - calibrate the inputs before the arithmetic launders them.
- The worry is that the stated confidence is too tight to trust (overprecision), not that the central number is in the wrong place.
When NOT to Use
- Do not run it on the agent's own confidence. An LLM posing an equivalent bet to itself has no felt indifference to reveal; the test becomes the same self-report in different words, and verbalized model confidence is itself systematically overconfident (Xiong et al., 2024). This calibrates a human's stated intervals through elicitation. It is not a self-calibration device for the model. This is the central wall.
- Do not use it when the problem is the location of the estimate, not the width. A wrong number, well calibrated, is still wrong. Route a wrong central estimate to
think-reference-class-forecasting(anchor on the base rate of comparable cases) orthink-fermi-estimation(build the number from factors). Wrong number, use those; untrustworthy "sure," use this. - Do not calibrate an interval around a lookupable fact. Where the answer can simply be checked, or no genuine uncertainty exists, calibrating its interval is theater.
- Do not present a one-shot bet-test as a full calibration. Without resolvable items only the bet-test half applies, and the bet device is the least-evidenced part of the protocol; say plainly that the scorecard is one-legged.
- Do not promise full debiasing. The controlled record shows partial correction with a stubborn residue. Promise tighter honesty about uncertainty, not calibrated certainty.
- Do not confuse it with content moves. It never asks what information is missing (that is the
consider-the-unknownsmove) and never generates a second estimate to average (that isthink-dialectical-bootstrapping). It is content-blind: it only asks whether the stated number means what it claims.