advanced-evaluation
Installation
SKILL.md
Advanced Evaluation
Production-grade techniques for evaluating LLM outputs using LLMs as judges.
Evaluation Taxonomy
Direct Scoring
Single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following)
- Reliability: Moderate to high for well-defined criteria
- Failure mode: Score calibration drift
Pairwise Comparison
LLM compares two responses and selects the better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
- Failure mode: Position bias, length bias