validate-evaluator

Installation
SKILL.md

Validate Evaluator

Calibrate an LLM judge against human judgment.

Overview

  1. Split human-labeled data into train (10-20%), dev (40-45%), test (40-45%)
  2. Run judge on dev set and measure TPR/TNR
  3. Iterate on the judge until TPR and TNR > 90% on dev set
  4. Run once on held-out test set for final TPR/TNR
  5. Apply bias correction formula to production data

Prerequisites

  • A built LLM judge prompt (from write-judge-prompt)
  • Human-labeled data: ~100 traces with binary Pass/Fail labels per failure mode
    • Aim for ~50 Pass and ~50 Fail (balanced, even if real distribution is skewed)
    • Labels must come from a domain expert, not outsourced annotators
  • Candidate few-shot examples from your labeled data
Related skills
Installs
248
GitHub Stars
1.3K
First Seen
Mar 3, 2026