evaluation

Installation
SKILL.md

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

  • 80% from prompt tokens (wording, structure, examples)
  • 15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Installs
1
GitHub Stars
28
First Seen
May 2, 2026
evaluation — greyhaven-ai/claude-code-config