Calibration Probe — predict the regime before you run the methodology

A 30-second synthetic test that classifies a candidate LLM's number-picking behavior into one of five regimes, telling the practitioner whether the evidence-scoring methodology will help, hurt, or make no difference before they spend hours running a full pipeline.

This skill is a preflight diagnostic for the seven-principle evidence-scoring methodology from Don't Let the LLM Pick a Number. It is the empirical answer to the question reviewers always ask: "does this methodology always help?" The answer is: no, but here's how to tell whether it will help on your model.

When to use this

You're about to deploy an LLM-as-judge and want a cheap diagnostic first.
You've never tested whether your specific model is reliable at scoring.
You're comparing several candidate models and want to pick the most calibrated one.
A practitioner asks "should I run the full pipeline or use a lighter touch?" — the probe answers that empirically, not by gut feel.
You ran the methodology and got worse results than naive scoring; the probe diagnoses why.

If the user wants to actually score something, use evidence-scoring, what-works-feedback-judge, or hackathon-judge. This skill answers the meta question — should you bother — not the scoring question itself.

The five regimes

Every LLM, on a given task, falls into one of these number-picking shapes:

calibration-probe

Calibration Probe — predict the regime before you run the methodology

When to use this

The five regimes