eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development principles.
- Defines capability and regression evals with pass/fail criteria before implementation, treating evals as unit tests for AI-assisted workflows
- Supports three grader types: code-based (deterministic checks via bash/grep), model-based (Claude-as-judge), and human review for manual adjudication
- Tracks reliability with pass@k metrics (success within k attempts) and pass^k (all k trials succeed), with recommended thresholds of pass@3 ≥ 90% for capabilities and pass^3 = 100% for regressions
- Integrates into Claude Code workflow with commands to define evals before coding, check status during implementation, and generate reports post-completion
- Stores eval definitions, run history, and baselines in
.claude/evals/directory for version control alongside code
Eval Harness Skill
A formal evaluation framework for Claude Code sessions, implementing eval-driven development (EDD) principles.
When to Activate
- Setting up eval-driven development (EDD) for AI-assisted workflows
- Defining pass/fail criteria for Claude Code task completion
- Measuring agent reliability with pass@k metrics
- Creating regression test suites for prompt or agent changes
- Benchmarking agent performance across model versions
Philosophy
Eval-Driven Development treats evals as the "unit tests of AI development":
- Define expected behavior BEFORE implementation
- Run evals continuously during development
- Track regressions with each change
- Use pass@k metrics for reliability measurement
More from affaan-m/everything-claude-code
security-review
Use this skill when adding authentication, handling user input, working with secrets, creating API endpoints, or implementing payment/sensitive features. Provides comprehensive security checklist and patterns.
7.9Kgolang-patterns
Idiomatic Go patterns, best practices, and conventions for building robust, efficient, and maintainable Go applications.
7.4Kcoding-standards
Baseline cross-project coding conventions for naming, readability, immutability, and code-quality review. Use detailed frontend or backend skills for framework-specific patterns.
6.7Kfrontend-patterns
Frontend development patterns for React, Next.js, state management, performance optimization, and UI best practices.
6.6Kbackend-patterns
Backend architecture patterns, API design, database optimization, and server-side best practices for Node.js, Express, and Next.js API routes.
6.6Kgolang-testing
Go testing patterns including table-driven tests, subtests, benchmarks, fuzzing, and test coverage. Follows TDD methodology with idiomatic Go practices.
6.1K