benchmark-sandbox
Installation
SKILL.md
Benchmark Sandbox — Remote Eval via Vercel Sandboxes
Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
- Phase 1 (BUILD): Claude Code builds the app with
--dangerously-skip-permissions --debug - Phase 2 (VERIFY): A follow-up Claude Code session uses
agent-browserto walk through user stories, fixing issues until all pass (20 min timeout) - Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs
vercel deploy, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (claude -p --json-schema --model haiku) evaluates the results as structured JSON.
Proven Working Script
Use run-eval.ts — the proven eval runner:
# Run default scenarios with full 3-phase pipeline
bun run .claude/skills/benchmark-sandbox/run-eval.ts