agent-benchmark

Installation
SKILL.md

Agent Benchmark Framework

Without benchmarks, we cannot know whether agent changes improve or degrade quality. This skill defines how to measure, track, and protect agent performance.

When to Activate

  • Before and after modifying any agent definition file
  • When adding a new skill that an agent depends on
  • Periodic quality audits (weekly/monthly)
  • When a user reports degraded agent output
  • Before promoting an agent from experimental to production

Core Concepts

Why Benchmarks Matter

Agent quality degrades silently. A prompt tweak that improves one response can break ten others. Without a baseline to compare against, every change is a guess. Benchmarks make quality visible and regressions detectable.

Benchmark Types

Installs
10
GitHub Stars
507
First Seen
Apr 24, 2026
agent-benchmark — vibeeval/vibecosystem