model-evaluation-benchmark

Installation

SKILL.md

Model Evaluation Benchmark Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.

Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.

Skill Description

This skill orchestrates end-to-end model evaluation benchmarks that measure:

Efficiency: Duration, turns, cost, tool calls
Quality: Code quality scores via reviewer agents
Workflow Adherence: Subagent calls, skills used, workflow step compliance
Artifacts: GitHub issues, PRs, documentation generated

The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.

When to Use

Related skills

More from rysweet/amplihack

Installs

130

Repository

rysweet/amplihack

GitHub Stars

62

First Seen

Jan 23, 2026

Security Audits

Gen Agent Trust HubPass