write-judge-prompt

Installation

SKILL.md

Write LLM-as-Judge Prompt

Design a binary Pass/Fail LLM-as-Judge evaluator for one specific failure mode. Each judge checks exactly one thing.

Prerequisites

Error analysis is complete. The failure mode is identified.
You have human-labeled traces for this failure mode (at least 20 Pass and 20 Fail examples).
A code-based evaluator cannot check this failure mode. Exhaust code-based options before reaching for a judge — many failure modes that seem subjective reduce to keyword checks, regex, or API calls when you understand the domain. Example: detecting whether an AI interviewing coach suggests "general" questions (asking about typical behavior instead of a specific past event) seems to require semantic understanding, but in practice a keyword check for words like "usually," "typical," and "normally" could work quite well.

The Four Components

Every judge prompt requires exactly four components:

1. Task and Evaluation Criterion

State what the judge evaluates. One failure mode per judge.

Related skills

More from hamelsmu/evals-skills

Installs

286

Repository

hamelsmu/evals-skills

GitHub Stars

1.3K

First Seen

Mar 3, 2026

Security Audits

Gen Agent Trust HubPass