llm-as-a-judge

Installation
SKILL.md

LLM-as-a-Judge

Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.

When to Use LLM-as-Judge vs. Code

Choose the right evaluator type for each failure mode:

Use code-based evaluators when the failure is objective and deterministic:

  • JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
  • These are fast, cheap, deterministic, and interpretable.

Use LLM-as-Judge when the failure requires interpretation or nuance:

  • Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
  • These require a separate LLM (distinct from the application) to judge outputs.

Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.

The Full Workflow

Related skills
Installs
12
GitHub Stars
8
First Seen
Feb 19, 2026