benchmark-agents
Benchmark Agents — Advanced AI Systems
Launch real Claude Code sessions with the plugin installed, verify skill injection, monitor PostToolUse validation catches, and produce a coverage report. This skill covers the full eval loop: setup → launch → monitor → verify → fix → release → repeat.
How Evals Work (The Only Correct Method)
Evals are run by you, in this conversation, not by scripts. The process is:
- You create directories and install the plugin via Bash tool calls
- You spawn WezTerm panes with
wezterm cli spawn— each pane runs an independent Claude Code interactive session - You wait, then check debug logs and claim dirs to see what the plugin injected
- You inspect the generated source code for correctness
- You read conversation logs to find what the user had to correct
- You update skills/hooks, run
/release, and spawn more evals
Never use claude --print, eval scripts, or Bun.spawn(["claude", ...]). These do not work because:
- Plugin hooks (PreToolUse, PostToolUse, UserPromptSubmit) only fire during interactive tool-calling sessions
--printmode generates text without executing tools — no files are created, no deps installed, no dev servers started- No
session_idmeans dedup, profiler, and claim files don't work
More from vercel-labs/vercel-plugin
nextjs
Next.js App Router expert guidance. Use when building, debugging, or architecting Next.js applications — routing, Server Components, Server Actions, Cache Components, layouts, middleware/proxy, data fetching, rendering strategies, and deployment on Vercel.
3.6Kreact-best-practices
React best-practices reviewer for TSX files. Triggers after editing multiple TSX components to run a condensed quality checklist covering component structure, hooks usage, accessibility, performance, and TypeScript patterns.
471shadcn
shadcn/ui expert guidance — CLI, component installation, composition patterns, custom registries, theming, Tailwind CSS integration, and high-quality interface design. Use when initializing shadcn, adding components, composing product UI, building custom registries, configuring themes, or troubleshooting component issues.
311turbopack
Turbopack expert guidance. Use when configuring the Next.js bundler, optimizing HMR, debugging build issues, or understanding the Turbopack vs Webpack differences.
246deployments-cicd
Vercel deployment and CI/CD expert guidance. Use when deploying, promoting, rolling back, inspecting deployments, building with --prebuilt, or configuring CI workflow files for Vercel.
227ai-sdk
Vercel AI SDK expert guidance. Use when building AI-powered features — chat interfaces, text generation, structured output, tool calling, agents, MCP integration, streaming, embeddings, reranking, image generation, or working with any LLM provider.
221