advanced-evaluation
Advanced Evaluation
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
When to Activate
Activate this skill when:
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards across evaluation teams
- Debugging evaluation systems that show inconsistent results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
- Analyzing correlation between automated and human judgments
Core Concepts
More from arustydev/ai
cross-browser-compatibility
Browser API differences, polyfills, and feature detection for Firefox, Chrome, Safari, and Edge extensions
16pkgmgr-homebrew-formula-dev
Create, test, and maintain Homebrew formulas. Use when adding packages to a Homebrew tap, debugging formula issues, running brew audit/test, or automating version updates with livecheck. Use when creating a new Homebrew formula for a project.
15seo-for-developers
SEO fundamentals for technical blog posts — meta tags, structured data, keyword placement, and readability optimization
15extension-anti-patterns
Common mistakes, performance pitfalls, and store rejection reasons in browser extension development
12wxt-framework-patterns
Comprehensive WXT browser extension framework patterns, security hardening rules, and cross-browser configuration
12beads
>
11