advanced-evaluation

Installation
SKILL.md

Advanced Evaluation

Production-grade techniques for evaluating LLM outputs using LLMs as judges.

Evaluation Taxonomy

Direct Scoring

Single LLM rates one response on a defined scale.

  • Best for: Objective criteria (factual accuracy, instruction following)
  • Reliability: Moderate to high for well-defined criteria
  • Failure mode: Score calibration drift

Pairwise Comparison

LLM compares two responses and selects the better one.

  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences
  • Failure mode: Position bias, length bias
Installs
1
First Seen
3 days ago
advanced-evaluation — 5dlabs/cto-agents