validate-evaluator
Installation
SKILL.md
Validate Evaluator
Calibrate an LLM judge against human judgment.
Overview
- Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
- Run judge on dev set and measure TPR/TNR
- Iterate on the judge until TPR and TNR > 90% on dev set
- Run once on held-out test set for final TPR/TNR
- Apply bias correction formula to production data
Prerequisites
- A built LLM judge prompt (from write-judge-prompt)
- Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
- Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
- Labels must come from a domain expert, not outsourced annotators
- Candidate few-shot examples from your labeled data
Related skills