skills/smithery.ai/evaluation

evaluation

SKILL.md

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

  • 80% from prompt tokens (wording, structure, examples)
  • 15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Installs
3
First Seen
Mar 20, 2026