eval-harness

Installation

Summary

Formal evaluation framework for Claude Code sessions implementing eval-driven development principles.

Defines capability and regression evals with pass/fail criteria before implementation, treating evals as unit tests for AI-assisted workflows
Supports three grader types: code-based (deterministic checks via bash/grep), model-based (Claude-as-judge), and human review for manual adjudication
Tracks reliability with pass@k metrics (success within k attempts) and pass^k (all k trials succeed), with recommended thresholds of pass@3 ≥ 90% for capabilities and pass^3 = 100% for regressions
Integrates into Claude Code workflow with commands to define evals before coding, check status during implementation, and generate reports post-completion
Stores eval definitions, run history, and baselines in .claude/evals/ directory for version control alongside code

SKILL.md

Eval Harness Skill

A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.

Eval-Driven Development treats evals as the "unit tests of AI development":

Related skills

Installs

4.0K

Repository

GitHub Stars

179.7K

First Seen

Jan 26, 2026

Security Audits