benchmark-sandbox

Installation
SKILL.md

Benchmark Sandbox — Remote Eval via Vercel Sandboxes

Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:

  • Phase 1 (BUILD): Claude Code builds the app with --dangerously-skip-permissions --debug
  • Phase 2 (VERIFY): A follow-up Claude Code session uses agent-browser to walk through user stories, fixing issues until all pass (20 min timeout)
  • Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs vercel deploy, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.

Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (claude -p --json-schema --model haiku) evaluates the results as structured JSON.

Proven Working Script

Use run-eval.ts — the proven eval runner:

# Run default scenarios with full 3-phase pipeline
bun run .claude/skills/benchmark-sandbox/run-eval.ts

# With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
Related skills

More from vercel-labs/vercel-plugin

Installs
144
GitHub Stars
162
First Seen
Mar 15, 2026