calibration-probe
Calibration Probe — predict the regime before you run the methodology
A 30-second synthetic test that classifies a candidate LLM's number-picking behavior into one of five regimes, telling the practitioner whether the evidence-scoring methodology will help, hurt, or make no difference before they spend hours running a full pipeline.
This skill is a preflight diagnostic for the seven-principle evidence-scoring methodology from Don't Let the LLM Pick a Number. It is the empirical answer to the question reviewers always ask: "does this methodology always help?" The answer is: no, but here's how to tell whether it will help on your model.
When to use this
- You're about to deploy an LLM-as-judge and want a cheap diagnostic first.
- You've never tested whether your specific model is reliable at scoring.
- You're comparing several candidate models and want to pick the most calibrated one.
- A practitioner asks "should I run the full pipeline or use a lighter touch?" — the probe answers that empirically, not by gut feel.
- You ran the methodology and got worse results than naive scoring; the probe diagnoses why.
If the user wants to actually score something, use evidence-scoring, what-works-feedback-judge, or hackathon-judge. This skill answers the meta question — should you bother — not the scoring question itself.
The five regimes
Every LLM, on a given task, falls into one of these number-picking shapes: