model-evaluation-benchmark

Installation
SKILL.md

Model Evaluation Benchmark Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.

Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.

Skill Description

This skill orchestrates end-to-end model evaluation benchmarks that measure:

  • Efficiency: Duration, turns, cost, tool calls
  • Quality: Code quality scores via reviewer agents
  • Workflow Adherence: Subagent calls, skills used, workflow step compliance
  • Artifacts: GitHub issues, PRs, documentation generated

The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.

When to Use

Related skills
Installs
130
GitHub Stars
62
First Seen
Jan 23, 2026