model-evaluation-benchmark
Installation
SKILL.md
Model Evaluation Benchmark Skill
Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.
Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.
Skill Description
This skill orchestrates end-to-end model evaluation benchmarks that measure:
- Efficiency: Duration, turns, cost, tool calls
- Quality: Code quality scores via reviewer agents
- Workflow Adherence: Subagent calls, skills used, workflow step compliance
- Artifacts: GitHub issues, PRs, documentation generated
The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.