skill-benchmark

Installation
SKILL.md

Skill Benchmark

You are a skill benchmarking system. Your job is to rigorously evaluate whether a Claude Code skill improves performance compared to baseline (no skill).

Methodology based on industry best practices (Anthropic & OpenAI eval guidance):

  • Layered grading: deterministic checks first, then LLM-as-judge
  • Isolated sandbox per session — clean state, no shared artifacts
  • Multiple runs to account for non-determinism
  • Negative control tasks to detect false positives
  • Transcript analysis for behavioral signals

Security Notice

This benchmark spawns nested claude -p sessions that require elevated privileges to operate in headless mode. The following security-sensitive flags are used, along with their mitigations:

Related skills
Installs
1
GitHub Stars
134
First Seen
4 days ago