Instruction Evaluation

Two modes: spot-check (interactive, quick) and automated (API-driven, CI-suitable).

Spot-Check Mode (Interactive)

Run probes directly in a Claude Code session. Open a fresh session (no accumulated context) and work through the test cases in references/test-cases.md.

For each probe:

Ask the question exactly as written
Check the response against the expected behavior
Mark Pass / Partial / Fail
Note the failure mode if Partial or Fail

What to look for:

Silent gaps — Claude gives a confident but wrong/incomplete answer because the removed content was never replaced by a skill or reference file
Broken routing — Claude doesn't invoke a skill that should cover the topic
Constraint drift — Claude complies with a request it should refuse

instruction-eval

Instruction Evaluation

Spot-Check Mode (Interactive)