evaluation
SKILL.md
Evaluation Skill
Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.
Core Insight: The 95% Variance Finding
Research shows 95% of output variance comes from just two sources:
- 80% from prompt tokens (wording, structure, examples)
- 15% from random seed/sampling
Temperature, model version, and other factors account for only 5%.
Implication: Focus evaluation on prompt quality, not model tweaking.