Metric Validation Harness

Point this harness at a candidate metric and a corpus, and it runs experiments that try to falsify each property a trustworthy, optimizable metric must have. It is the empirical companion to deterministic-metric-design: that skill tells you to prove monotonicity, invariance, determinism, and construct validity; this skill runs the experiment and reports PASS/FAIL, each result mapped to the design-skill category it checks.

Read-only. It computes and reports; it never modifies your metric, the corpus, or any external state. Safe to run unsupervised.

When to Apply

Someone proposes, reviews, tunes, or ships a metric / score / index and you need evidence it is sound
A score "feels off" — you suspect it tracks LOC, jumps between runs, or saturates
You are about to let an agent optimize a metric and need to know it can't be gamed by cosmetic edits
You built a candidate per deterministic-metric-design and want to empirically confirm the properties you argued for
You are choosing between two metrics and need to know which actually predicts the outcome (and beats a trivial baseline)

metric-validation-harness

Metric Validation Harness

When to Apply

Workflow Overview