advanced-evaluation

Originally fromshipshitdev/library
Installation
SKILL.md

Advanced Evaluation

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

When to Use

Activate this skill when:

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards across evaluation teams
  • Debugging evaluation systems that show inconsistent results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • Analyzing correlation between automated and human judgments

Core Concepts

Related skills
Installs
87
GitHub Stars
37.2K
First Seen
Mar 12, 2026