agent-benchmark
Installation
SKILL.md
Agent Benchmark Framework
Without benchmarks, we cannot know whether agent changes improve or degrade quality. This skill defines how to measure, track, and protect agent performance.
When to Activate
- Before and after modifying any agent definition file
- When adding a new skill that an agent depends on
- Periodic quality audits (weekly/monthly)
- When a user reports degraded agent output
- Before promoting an agent from experimental to production
Core Concepts
Why Benchmarks Matter
Agent quality degrades silently. A prompt tweak that improves one response can break ten others. Without a baseline to compare against, every change is a guess. Benchmarks make quality visible and regressions detectable.