judgment-eval

Installation
SKILL.md

Judgment Evaluation Skill

Priorities

Realism (scenarios must be plausible) > Diagnostic Value (reveals actual judgment gaps) > Coverage (test multiple dimensions)

Reasoning: Unrealistic scenarios produce false signals. Diagnostic value ensures we learn from failures. Coverage prevents overfitting to a single dimension.

Goal

Generate scenario-based tests from an agent definition or system prompt, then guide interactive evaluation to identify judgment strengths, weaknesses, and prompt improvement opportunities.

Constraints

Interactive Evaluation Only: This skill guides manual evaluation in-conversation. Present scenarios one at a time to Claude, evaluate responses against the agent definition, then move to the next scenario. Do NOT attempt automated execution or batch processing.

Scenario Realism: Every scenario must be plausible in actual usage. Avoid contrived corner cases that would never occur in practice.

Related skills
Installs
1
GitHub Stars
1
First Seen
Mar 29, 2026