experiment-audit

Installation
SKILL.md

Experiment Audit: Cross-Model Integrity Verification

🔒 Do not wrap this skill in /loop, /schedule, or CronCreate. It is verdict-bearing — it judges experiment integrity. Re-running that verdict on a timer adds no new signal, and a loop that accepts its own output to decide when to stop crosses into self-acquittal (acceptance-gate.md). Schedule the external wait that precedes it — experiments done → then audit once. See shared-references/external-cadence.md.

Audit experiment integrity for: $ARGUMENTS

Why This Exists

LLM agents can produce fraudulent experimental results through:

  1. Fake ground truth — creating synthetic "reference" from model outputs, then reporting high agreement as performance
  2. Score normalization — dividing metrics by the model's own max to get 0.99+
  3. Phantom results — claiming numbers from files that don't exist or functions never called
  4. Insufficient scope — reporting 2-scene pilots as "comprehensive evaluation"
Installs
157
GitHub Stars
12.7K
First Seen
Apr 12, 2026
experiment-audit — wanshuiyin/auto-claude-code-research-in-sleep