benchmark-agents
Installation
SKILL.md
Benchmark Agents — Advanced AI Systems
Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat.
How Evals Work (The Only Correct Method)
Evals are run by you, in this conversation, not by scripts. The process is:
- You create directories and install the plugin via Bash tool calls
- You spawn WezTerm panes with
wezterm cli spawn— each pane runs an independent Claude Code interactive session - You wait, then check debug logs and claim dirs to see what the plugin injected
- You inspect the generated source code for correctness
- You read conversation logs to find what the user had to correct
- You update skills/hooks, run
/release, and spawn more evals
Never use claude --print, eval scripts, or Bun.spawn(["claude", ...]). These do not work because:
- Plugin hooks (PreToolUse, PostToolUse, UserPromptSubmit) only fire during interactive tool-calling sessions
--printmode generates text without executing tools — no files are created, no deps installed, no dev servers started- No
session_idmeans dedup, profiler, and claim files don't work