Judgment Evaluation Skill

Priorities

Realism (scenarios must be plausible) > Diagnostic Value (reveals actual judgment gaps) > Coverage (test multiple dimensions)

Reasoning: Unrealistic scenarios produce false signals. Diagnostic value ensures we learn from failures. Coverage prevents overfitting to a single dimension.

Goal

Generate scenario-based tests from an agent definition or system prompt, then guide interactive evaluation to identify judgment strengths, weaknesses, and prompt improvement opportunities.

Constraints

Interactive Evaluation Only: This skill guides manual evaluation in-conversation. Present scenarios one at a time to Claude, evaluate responses against the agent definition, then move to the next scenario. Do NOT attempt automated execution or batch processing.

Scenario Realism: Every scenario must be plausible in actual usage. Avoid contrived corner cases that would never occur in practice.

judgment-eval

Judgment Evaluation Skill

Priorities

Goal

Constraints

More from iamladi/cautious-computing-machine--sdlc-plugin

codex

gemini

interview

tdd

x-search

update-models