model-evaluation-benchmark
Model Evaluation Benchmark Skill
Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.
Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.
Skill Description
This skill orchestrates end-to-end model evaluation benchmarks that measure:
- Efficiency: Duration, turns, cost, tool calls
- Quality: Code quality scores via reviewer agents
- Workflow Adherence: Subagent calls, skills used, workflow step compliance
- Artifacts: GitHub issues, PRs, documentation generated
The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.
When to Use
More from rysweet/amplihack
cybersecurity-analyst
|
872lawyer-analyst
|
558pptx
Presentation creation, editing, and analysis. When Claude needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks
394mermaid-diagram-generator
|
375psychologist-analyst
|
348economist-analyst
|
346