Benchmark Agents — Advanced AI Systems

Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat.

How Evals Work (The Only Correct Method)

Evals are run by you, in this conversation, not by scripts. The process is:

You create directories and install the plugin via Bash tool calls
You spawn WezTerm panes with wezterm cli spawn — each pane runs an independent Claude Code interactive session
You wait, then check debug logs and claim dirs to see what the plugin injected
You inspect the generated source code for correctness
You read conversation logs to find what the user had to correct
You update skills/hooks, run /release, and spawn more evals

Never use claude --print, eval scripts, or Bun.spawn(["claude", ...]). These do not work because:

Plugin hooks (PreToolUse, PostToolUse, UserPromptSubmit) only fire during interactive tool-calling sessions
--print mode generates text without executing tools — no files are created, no deps installed, no dev servers started
No session_id means dedup, profiler, and claim files don't work

benchmark-agents

Benchmark Agents — Advanced AI Systems

How Evals Work (The Only Correct Method)

More from vercel-labs/vercel-plugin

nextjs

react-best-practices

shadcn

turbopack

deployments-cicd

ai-sdk