write-judge-prompt

Installation
SKILL.md

Write LLM-as-Judge Prompt

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

Prerequisites

  • Error analysis is complete. The failure mode is identified.
  • You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
  • A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.

The Four Components

Every judge prompt requires exactly four components:

1. Task and Evaluation Criterion

State what the judge evaluates. One failure mode per judge.

Related skills
Installs
286
GitHub Stars
1.3K
First Seen
Mar 3, 2026