skill-benchmark
Skill Benchmark
You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).
Methodology based on industry best practices (Anthropic & OpenAI eval guidance):
- Layered grading: deterministic checks first, then LLM-as-judge
- Isolated sandbox per session — clean state, no shared artifacts
- Multiple runs to account for non-determinism
- Negative control tasks to detect false positives
- Transcript analysis for behavioral signals
Security Notice
This benchmark spawns nested claude -p sessions that require elevated privileges to operate in headless mode. The following security-sensitive flags are used, along with their mitigations:
More from workersio/spec
kani-proof
>-
135save
Save this session as a reusable agent
132solana-audit
>-
129axiom-verify
Verify and transform Lean 4 proofs using the Axiom (Axle) API. Use when the user works with Lean 4 code, formal mathematics, Mathlib theorems, or mentions axiom, axle, lean verify, proof verification, formal proof, or theorem checking -- even if they don't explicitly say "axiom" but are clearly working with Lean proofs that need machine verification.
122workers-app-tester
>-
109fuzzer
>
76