judgment-eval
Judgment Evaluation Skill
Priorities
Realism (scenarios must be plausible) > Diagnostic Value (reveals actual judgment gaps) > Coverage (test multiple dimensions)
Reasoning: Unrealistic scenarios produce false signals. Diagnostic value ensures we learn from failures. Coverage prevents overfitting to a single dimension.
Goal
Generate scenario-based tests from an agent definition or system prompt, then guide interactive evaluation to identify judgment strengths, weaknesses, and prompt improvement opportunities.
Constraints
Interactive Evaluation Only: This skill guides manual evaluation in-conversation. Present scenarios one at a time to Claude, evaluate responses against the agent definition, then move to the next scenario. Do NOT attempt automated execution or batch processing.
Scenario Realism: Every scenario must be plausible in actual usage. Avoid contrived corner cases that would never occur in practice.
More from iamladi/cautious-computing-machine--sdlc-plugin
codex
Use when the user asks to run Codex CLI (codex exec, codex resume) or references OpenAI Codex for code analysis, refactoring, or automated editing. Resolves the latest flagship model from the model registry.
10gemini
Use when the user asks to run Gemini CLI for code review, plan review, or big context (>200k) processing. Ideal for comprehensive analysis requiring large context windows. Resolves the latest flagship model from the model registry.
7interview
Interview me about anything in depth
7tdd
TDD enforcement during implementation. Reads `tdd:` setting from CLAUDE.md. Modes - strict (human approval for escape), soft (warnings), off (disabled). Auto-invoked by /implement.
6x-search
Search X/Twitter for real-time developer discourse, product feedback, community sentiment, and expert opinions. Use when user says "x search", "search x for", "search twitter for", "what are people saying about", or needs recent X discourse for context (library releases, API changes, product launches, industry discussion). Also use when researching a library, framework, API, or product to supplement web search with real-time community signal — e.g. "research Bun", "what do devs think of Hono", "is Turso production-ready".
1update-models
Re-resolve the model registry by querying OpenAI Codex cache, Google AI API, and Oracle CLI. Use when models feel stale or after a major model release.
1