ai-evals

Installation
SKILL.md

AI Evals

Scope

Covers

  • Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
  • Converting failures into a golden test set + error taxonomy + rubric
  • Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
  • Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

  • “Design evals for this LLM feature so we can ship with confidence.”
  • “Create a rubric + golden set + benchmark for our AI assistant/copilot.”
  • “We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
  • “Compare prompts/models safely with a clear acceptance threshold.”

When NOT to use

  • You need to decide what to build (use problem-definition, building-with-llms, or ai-product-strategy).
  • You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
Related skills
Installs
6
GitHub Stars
50
First Seen
Jan 26, 2026