advanced-evaluation
Advanced Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
When to Use
Activate this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
Core Concepts
More from ranbot-ai/awesome-skills
web-games
Web browser game development principles. Framework selection, WebGPU, optimization, PWA.
68angular
Modern Angular (v20+) expert with deep knowledge of Signals, Standalone Components, Zoneless applications, SSR/Hydration, and reactive patterns.
28-21risk-automation
Automate 21risk tasks via Rube MCP (Composio). Always search tools first for current schemas.
2quaderno-automation
Automate Quaderno tasks via Rube MCP (Composio). Always search tools first for current schemas.
1zapier-make-patterns
No-code automation democratizes workflow building. Zapier and Make (formerly Integromat) let non-developers automate business processes without writing code. But no-code doesn't mean no-complexity ...
1azure-maps-search-dotnet
Azure Maps SDK for .NET. Location-based services including geocoding, routing, rendering, geolocation, and weather. Use for address search, directions, map tiles, IP geolocation, and weather data.
1