eval-designer

Installation

SKILL.md

Eval Designer

Overview

This skill covers end-to-end design of evaluation frameworks for LLM-powered systems. It helps teams define what "good" looks like for their specific use case, create diverse test suites that cover both capability and failure modes, design human evaluation rubrics with clear scoring criteria, implement automated eval pipelines using reference-based and LLM-as-judge approaches, and track quality over time as models and prompts change. A robust eval framework is the engineering foundation that enables confident model upgrades, prompt changes, and feature launches.

When to Use

Building an eval suite before deploying an LLM-powered feature for the first time
Designing automated evals to run in CI/CD pipelines for prompt or model changes
Creating human evaluation rubrics with scoring guidelines for labeler studies
Defining safety evals to test for harmful outputs, jailbreaks, or policy violations
Measuring quality regression after a model upgrade (e.g., GPT-4 → GPT-4o)
Setting up LLM-as-a-judge evaluation for tasks without clear ground truth
Establishing baseline metrics before A/B testing different prompts or models
Auditing an existing eval suite for coverage gaps or measurement validity

When NOT to Use

Training or fine-tuning models (use model training skills)
Collecting and curating datasets for training (use dataset-curator skill)
Comparing publicly available model benchmarks like MMLU or HumanEval (use model-comparator skill)

Related skills

More from nickcrew/claude-ctx-plugin

Installs

Repository

nickcrew/claude…x-plugin

GitHub Stars

First Seen

Apr 13, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

eval-designer

Eval Designer

Overview

When to Use

When NOT to Use

More from nickcrew/claude-ctx-plugin

react-performance-optimization

owasp-top-10

ui-design-aesthetics

helm-chart-patterns

code-explanation

security-testing-patterns