benchmark-sandbox
Benchmark Sandbox — Remote Eval via Vercel Sandboxes
Run benchmark scenarios inside Vercel Sandboxes — ephemeral Firecracker microVMs with node24. Each sandbox gets a fresh Claude Code + Vercel CLI + agent-browser install, the local vercel-plugin uploaded, and runs a 3-phase eval pipeline:
- Phase 1 (BUILD): Claude Code builds the app with
--dangerously-skip-permissions --debug - Phase 2 (VERIFY): A follow-up Claude Code session uses
agent-browserto walk through user stories, fixing issues until all pass (20 min timeout) - Phase 3 (DEPLOY): A third Claude Code session links to vercel-labs, runs
vercel deploy, and fixes build errors (up to 3 retries). Deployed apps have deployment protection enabled by default.
Skills are tracked across all 3 phases — each phase may trigger additional skill injections as new files/patterns are created. After each phase, a haiku structured scoring step (claude -p --json-schema --model haiku) evaluates the results as structured JSON.
Proven Working Script
Use run-eval.ts — the proven eval runner:
# Run default scenarios with full 3-phase pipeline
bun run .claude/skills/benchmark-sandbox/run-eval.ts
# With dynamic scenarios from a JSON file (recommended — see "Dynamic Scenarios" below)
More from vercel-labs/vercel-plugin
nextjs
Next.js App Router expert guidance. Use when building, debugging, or architecting Next.js applications — routing, Server Components, Server Actions, Cache Components, layouts, middleware/proxy, data fetching, rendering strategies, and deployment on Vercel.
3.6Kreact-best-practices
React best-practices reviewer for TSX files. Triggers after editing multiple TSX components to run a condensed quality checklist covering component structure, hooks usage, accessibility, performance, and TypeScript patterns.
471shadcn
shadcn/ui expert guidance — CLI, component installation, composition patterns, custom registries, theming, Tailwind CSS integration, and high-quality interface design. Use when initializing shadcn, adding components, composing product UI, building custom registries, configuring themes, or troubleshooting component issues.
311turbopack
Turbopack expert guidance. Use when configuring the Next.js bundler, optimizing HMR, debugging build issues, or understanding the Turbopack vs Webpack differences.
246deployments-cicd
Vercel deployment and CI/CD expert guidance. Use when deploying, promoting, rolling back, inspecting deployments, building with --prebuilt, or configuring CI workflow files for Vercel.
227ai-sdk
Vercel AI SDK expert guidance. Use when building AI-powered features — chat interfaces, text generation, structured output, tool calling, agents, MCP integration, streaming, embeddings, reranking, image generation, or working with any LLM provider.
221