eval-designer
Eval Designer
Overview
This skill covers end-to-end design of evaluation frameworks for LLM-powered systems. It helps teams define what "good" looks like for their specific use case, create diverse test suites that cover both capability and failure modes, design human evaluation rubrics with clear scoring criteria, implement automated eval pipelines using reference-based and LLM-as-judge approaches, and track quality over time as models and prompts change. A robust eval framework is the engineering foundation that enables confident model upgrades, prompt changes, and feature launches.
When to Use
- Building an eval suite before deploying an LLM-powered feature for the first time
- Designing automated evals to run in CI/CD pipelines for prompt or model changes
- Creating human evaluation rubrics with scoring guidelines for labeler studies
- Defining safety evals to test for harmful outputs, jailbreaks, or policy violations
- Measuring quality regression after a model upgrade (e.g., GPT-4 → GPT-4o)
- Setting up LLM-as-a-judge evaluation for tasks without clear ground truth
- Establishing baseline metrics before A/B testing different prompts or models
- Auditing an existing eval suite for coverage gaps or measurement validity
When NOT to Use
- Training or fine-tuning models (use model training skills)
- Collecting and curating datasets for training (use dataset-curator skill)
- Comparing publicly available model benchmarks like MMLU or HumanEval (use model-comparator skill)
More from nickcrew/claude-ctx-plugin
react-performance-optimization
React performance optimization patterns using memoization, code splitting, and efficient rendering strategies. Use when optimizing slow React applications, reducing bundle size, or improving user experience with large datasets.
1.2Kowasp-top-10
OWASP Top 10 security vulnerabilities with detection and remediation patterns. Use when conducting security audits, implementing secure coding practices, or reviewing code for common security vulnerabilities.
452ui-design-aesthetics
Generates high-quality, non-generic UI designs with a focus on performance, progressive disclosure, and distinctive aesthetics.
114helm-chart-patterns
Helm chart development patterns for packaging and deploying Kubernetes applications. Use when creating reusable Helm charts, managing multi-environment deployments, or building application catalogs for Kubernetes.
110code-explanation
Use when explaining code, concepts, or system behavior to a specific audience level - provides a structured explanation workflow with depth control and validation steps.
103security-testing-patterns
Security testing patterns including SAST, DAST, penetration testing, and vulnerability assessment techniques. Use when implementing security testing pipelines, conducting security audits, or validating application security controls.
91