evals
Installation
SKILL.md
evals — evaluation harness
A sibling skill to senate. Runs fixtures through the full senate lifecycle, then grades the result with two complementary signals:
- Deterministic graders — schema/contract conformance against
skills/senate/references/workspace.md. Cheap, run first. - LLM judges — quality of
notes.md(scored under both theverdictrubric and themeeting_notesrubric),agenda.md, and transcript process. Invoked viaclaude -p --output-format json(no API key needed; uses your Claude Code OAuth session).
Methodology follows Demystifying evals for AI agents — capability vs. regression sets, deterministic + model-based + (eventual) human review, eval-driven iteration.
When to trigger
- User asks to evaluate, benchmark, or test the senate skill.
- User adds a new CLI playbook (
skills/invoke-agent/references/<name>.md) and wants to validate it. - User edits a format file and wants to confirm it still produces parseable output.
- CI job running nightly.