instruction-eval
Installation
SKILL.md
Instruction Evaluation
Two modes: spot-check (interactive, quick) and automated (API-driven, CI-suitable).
Spot-Check Mode (Interactive)
Run probes directly in a Claude Code session. Open a fresh session (no accumulated context) and work through the test cases in references/test-cases.md.
For each probe:
- Ask the question exactly as written
- Check the response against the expected behavior
- Mark Pass / Partial / Fail
- Note the failure mode if Partial or Fail
What to look for:
- Silent gaps — Claude gives a confident but wrong/incomplete answer because the removed content was never replaced by a skill or reference file
- Broken routing — Claude doesn't invoke a skill that should cover the topic
- Constraint drift — Claude complies with a request it should refuse