ai-evals
Installation
SKILL.md
AI Evals
Scope
Covers
- Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
- Converting failures into a golden test set + error taxonomy + rubric
- Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
- Producing decision-ready results and an iteration loop (every bug becomes a new test)
When to use
- “Design evals for this LLM feature so we can ship with confidence.”
- “Create a rubric + golden set + benchmark for our AI assistant/copilot.”
- “We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
- “Compare prompts/models safely with a clear acceptance threshold.”
When NOT to use
- You need to decide what to build (use
problem-definition,building-with-llms, orai-product-strategy). - You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
Related skills
More from liqiongyu/lenny_skills_plus
problem-definition
Define a product problem: problem statement, JTBD, alternatives, evidence, metrics. See also: writing-prds (solution spec).
13giving-presentations
Plan and deliver presentations: brief, narrative, slide outline, Q&A bank, rehearsal plan. See also: written-communication (async writing).
13competitive-analysis
Produce a Competitive Analysis Pack (alternatives map, landscape, battlecards, monitoring plan).
11pricing-strategy
Create a Pricing Strategy Pack (value metric, packaging, price points, conversion mechanics, rollout).
11startup-ideation
Generate and evaluate startup ideas: theses table, scorecard, top idea brief, validation plan. See also: startup-pivoting (existing product).
11writing-prds
Write a decision-ready PRD for cross-functional alignment. See also: writing-specs-designs (build-ready spec).
11