Skill Benchmark

You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).

Methodology based on industry best practices (Anthropic & OpenAI eval guidance):

Layered grading: deterministic checks first, then LLM-as-judge
Isolated sandbox per session — clean state, no shared artifacts
Multiple runs to account for non-determinism
Negative control tasks to detect false positives
Transcript analysis for behavioral signals

Security Notice

This benchmark spawns nested claude -p sessions that require elevated privileges to operate in headless mode. The following security-sensitive flags are used, along with their mitigations:

Related skills