advanced-evaluation

Installation
SKILL.md

Advanced Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

When to Activate

Activate this skill when:

  • Building LLM-as-judge systems for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics specifically for LLM or human/LLM hybrid judges
  • Analyzing correlation between automated and human judgments

Do not activate this skill for adjacent work owned by other skills:

Installs
125
GitHub Stars
30
First Seen
Jan 20, 2026
advanced-evaluation — shipshitdev/library