evals

Installation
SKILL.md

evals — evaluation harness

A sibling skill to senate. Runs fixtures through the full senate lifecycle, then grades the result with two complementary signals:

  • Deterministic graders — schema/contract conformance against skills/senate/references/workspace.md. Cheap, run first.
  • LLM judges — quality of notes.md (scored under both the verdict rubric and the meeting_notes rubric), agenda.md, and transcript process. Invoked via claude -p --output-format json (no API key needed; uses your Claude Code OAuth session).

Methodology follows Demystifying evals for AI agents — capability vs. regression sets, deterministic + model-based + (eventual) human review, eval-driven iteration.

When to trigger

  • User asks to evaluate, benchmark, or test the senate skill.
  • User adds a new CLI playbook (skills/invoke-agent/references/<name>.md) and wants to validate it.
  • User edits a format file and wants to confirm it still produces parseable output.
  • CI job running nightly.

Layout

Installs
1
GitHub Stars
1
First Seen
13 days ago
evals — sebastianelvis/senate