Agent Benchmark Framework

Without benchmarks, we cannot know whether agent changes improve or degrade quality. This skill defines how to measure, track, and protect agent performance.

When to Activate

Before and after modifying any agent definition file
When adding a new skill that an agent depends on
Periodic quality audits (weekly/monthly)
When a user reports degraded agent output
Before promoting an agent from experimental to production

Core Concepts

Why Benchmarks Matter

Agent quality degrades silently. A prompt tweak that improves one response can break ten others. Without a baseline to compare against, every change is a guess. Benchmarks make quality visible and regressions detectable.

agent-benchmark

Agent Benchmark Framework

When to Activate

Core Concepts

Why Benchmarks Matter

Benchmark Types