evals — evaluation harness

A sibling skill to senate. Runs fixtures through the full senate lifecycle, then grades the result with two complementary signals:

Deterministic graders — schema/contract conformance against skills/senate/references/workspace.md. Cheap, run first.
LLM judges — quality of notes.md (scored under both the verdict rubric and the meeting_notes rubric), agenda.md, and transcript process. Invoked via claude -p --output-format json (no API key needed; uses your Claude Code OAuth session).

Methodology follows Demystifying evals for AI agents — capability vs. regression sets, deterministic + model-based + (eventual) human review, eval-driven iteration.

When to trigger

User asks to evaluate, benchmark, or test the senate skill.
User adds a new CLI playbook (skills/invoke-agent/references/<name>.md) and wants to validate it.
User edits a format file and wants to confirm it still produces parseable output.
CI job running nightly.

evals

evals — evaluation harness

When to trigger

Layout