ai-evals

Installation

SKILL.md

AI Evals

Scope

Covers

Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
Converting failures into a golden test set + error taxonomy + rubric
Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

“Design evals for this LLM feature so we can ship with confidence.”
“Create a rubric + golden set + benchmark for our AI assistant/copilot.”
“We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
“Compare prompts/models safely with a clear acceptance threshold.”

Installs

32

Repository

oldwinter/skills

GitHub Stars

3

First Seen

Jan 27, 2026

Security Audits

Gen Agent Trust HubPass

ai-evals — oldwinter/skills