write-judge-prompt
Installation
SKILL.md
Write LLM-as-Judge Prompt
Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.
Prerequisites
- Error analysis is complete. The failure mode is identified.
- You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
- A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.
The Four Components
Every judge prompt requires exactly four components:
1. Task and Evaluation Criterion
State what the judge evaluates. One failure mode per judge.
Related skills