skills/smithery.ai/evaluation

evaluation

SKILL.md

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

80% from prompt tokens (wording, structure, examples)
15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Installs

3

Source

smithery.ai/ski…aluation

First Seen

Mar 20, 2026