LLM-as-a-Judge

Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.

When to Use LLM-as-Judge vs. Code

Choose the right evaluator type for each failure mode:

Use code-based evaluators when the failure is objective and deterministic:

JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
These are fast, cheap, deterministic, and interpretable.

Use LLM-as-Judge when the failure requires interpretation or nuance:

Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
These require a separate LLM (distinct from the application) to judge outputs.

Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.

llm-as-a-judge

LLM-as-a-Judge

When to Use LLM-as-Judge vs. Code

The Full Workflow

More from maragudk/evals-skills

failure-taxonomy

prompt-engineering

trace-annotation-tool