metric-validation-harness
Installation
SKILL.md
Metric Validation Harness
Point this harness at a candidate metric and a corpus, and it runs experiments that try to falsify each property a trustworthy, optimizable metric must have. It is the empirical companion to deterministic-metric-design: that skill tells you to prove monotonicity, invariance, determinism, and construct validity; this skill runs the experiment and reports PASS/FAIL, each result mapped to the design-skill category it checks.
Read-only. It computes and reports; it never modifies your metric, the corpus, or any external state. Safe to run unsupervised.
When to Apply
- Someone proposes, reviews, tunes, or ships a metric / score / index and you need evidence it is sound
- A score "feels off" — you suspect it tracks LOC, jumps between runs, or saturates
- You are about to let an agent optimize a metric and need to know it can't be gamed by cosmetic edits
- You built a candidate per
deterministic-metric-designand want to empirically confirm the properties you argued for - You are choosing between two metrics and need to know which actually predicts the outcome (and beats a trivial baseline)