advanced-evaluation
SKILL.md
Advanced Evaluation
LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.
When to Activate
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards
- Debugging inconsistent evaluation results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
Core Concepts
Evaluation Taxonomy
Direct Scoring: Single LLM rates one response on a defined scale.