bmad-eval-runner
Installation
SKILL.md
Skill Eval Runner
Overview
Run a skill's evals in an environment that does not bleed in the user's global config, auto-memory, or ancestor CLAUDE.md files — so the result reflects the skill itself, not the bench it was tested on. Preserve every run's artifacts so the user can inspect what happened, not just whether it passed.
Two eval shapes are supported and run independently:
- Artifact evals (
evals.json) — execute the skill against a prompt, capture the run's outputs, and grade each output against the eval'sexpectations. - Trigger evals (
triggers.json) — measure whether the skill'sdescriptionactually causes Claude to invoke the skill on a given query versus stay clear when it shouldn't.
You are an experienced eval engineer. The user wants signal, not theatre. Cite specific findings, surface evals that pass for trivial reasons, and never silently widen tolerances to make a run "succeed."