eval-harness

Installation
Summary

Formal evaluation framework for Claude Code sessions implementing eval-driven development principles.

  • Defines capability and regression evals with pass/fail criteria before implementation, treating evals as unit tests for AI-assisted workflows
  • Supports three grader types: code-based (deterministic checks via bash/grep), model-based (Claude-as-judge), and human review for manual adjudication
  • Tracks reliability with pass@k metrics (success within k attempts) and pass^k (all k trials succeed), with recommended thresholds of pass@3 ≥ 90% for capabilities and pass^3 = 100% for regressions
  • Integrates into Claude Code workflow with commands to define evals before coding, check status during implementation, and generate reports post-completion
  • Stores eval definitions, run history, and baselines in .claude/evals/ directory for version control alongside code
SKILL.md

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

When to Activate

  • Setting up eval-driven development (EDD) for AI-assisted workflows
  • Defining pass/fail criteria for Claude Code task completion
  • Measuring agent reliability with pass@k metrics
  • Creating regression test suites for prompt or agent changes
  • Benchmarking agent performance across model versions

Philosophy

Eval-Driven Development treats evals as the "unit tests of AI development":

  • Define expected behavior BEFORE implementation
  • Run evals continuously during development
  • Track regressions with each change
  • Use pass@k metrics for reliability measurement
Related skills
Installs
4.0K
GitHub Stars
179.7K
First Seen
Jan 26, 2026