review-harness
Installation
SKILL.md
Review a Kaggle-Environments LLM Harness
This skill audits an existing harness for bugs that could plausibly affect gameplay. It complements create-harness (which builds harnesses).
Mindset
A good review finds both kinds of bugs:
- Known bugs — the patterns in the anti-pattern catalogue at the bottom of this document. Catching these is cheap (grep, then verify), high-confidence, and protects against regressions of issues we've already paid for once. Always run the catalogue checks. Skipping them because they feel mechanical is how harnesses ship with bugs we already knew how to find.
- Unknown bugs — the ones nobody has named yet. These are found by going to the engine, stress-testing the parser with adversarial inputs, reading the prompt as a hostile LLM would, and pulling on threads in the replay data. Each one becomes a new catalogue entry (Step 7) so the next reviewer gets it for free.
Structure the review around three questions, applied with both lenses:
- Is the prompt telling the model the truth about the game? (Verify every concrete claim against the actual game engine; also walk the prompt-pattern row of the catalogue.)
- Does the parser robustly recover the model's intent across the messy responses real LLMs produce? (Stress-test it with adversarial inputs; also walk the parser-pattern row of the catalogue.)
- Does the replay data show the harness behaving the way the code says it should? (Generic intent-vs-action mismatch scan; also targeted detectors for each catalogue pattern.)
Neither lens dominates. The catalogue tells you the cheapest, most reliable bugs to find first; the discovery techniques tell you what to do when the catalogue runs out.