llm-as-a-judge
LLM-as-a-Judge
Build reliable automated evaluators that use an LLM to judge the outputs of another LLM pipeline. Each judge targets a single, binary (Pass/Fail) failure mode identified during error analysis.
When to Use LLM-as-Judge vs. Code
Choose the right evaluator type for each failure mode:
Use code-based evaluators when the failure is objective and deterministic:
- JSON/SQL syntax validity, regex/string matching, structural constraints, execution errors, logical checks.
- These are fast, cheap, deterministic, and interpretable.
Use LLM-as-Judge when the failure requires interpretation or nuance:
- Tone appropriateness, summary faithfulness, response helpfulness, explanation clarity, creative quality.
- These require a separate LLM (distinct from the application) to judge outputs.
Each failure mode gets its own dedicated evaluator. Never combine multiple criteria into a single judge prompt—this introduces ambiguity and makes diagnosis harder.
The Full Workflow
More from maragudk/evals-skills
failure-taxonomy
>
4prompt-engineering
Use this skill when crafting, reviewing, or improving prompts for LLM pipelines — including task prompts, system prompts, and LLM-as-Judge prompts. Triggers include: requests to write or refine a prompt, diagnose why an LLM produces inconsistent or incorrect outputs, bridge the gap between intent and model behavior, reduce ambiguity in instructions, add few-shot examples, structure complex prompts, or improve output formatting. Also use when the user needs help distinguishing specification failures (unclear instructions) from generalization failures (model limitations), or when iterating on prompts based on observed failure modes. Do NOT use for general coding tasks, document creation, or non-LLM writing.
4trace-annotation-tool
>
4