experiment-audit
Installation
SKILL.md
Experiment Audit: Cross-Model Integrity Verification
🔒 Do not wrap this skill in
/loop,/schedule, orCronCreate. It is verdict-bearing — it judges experiment integrity. Re-running that verdict on a timer adds no new signal, and a loop that accepts its own output to decide when to stop crosses into self-acquittal (acceptance-gate.md). Schedule the external wait that precedes it — experiments done → then audit once. Seeshared-references/external-cadence.md.
Audit experiment integrity for: $ARGUMENTS
Why This Exists
LLM agents can produce fraudulent experimental results through:
- Fake ground truth — creating synthetic "reference" from model outputs, then reporting high agreement as performance
- Score normalization — dividing metrics by the model's own max to get 0.99+
- Phantom results — claiming numbers from files that don't exist or functions never called
- Insufficient scope — reporting 2-scene pilots as "comprehensive evaluation"