Creating a New Built-in Classification Evaluator

A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.

Step 0: Gather Requirements

Before writing anything, clarify with the user:

What does this evaluator measure? Get a one-sentence description of the quality dimension.
What input data is available? This determines the template placeholders (e.g., {{input}}, {{output}}, {{reference}}, {{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.
What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
Should this appear in the dataset experiments UI? If yes, it needs the promoted_dataset_evaluator label. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.

Step 1: Create the YAML Config

Create prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.

Read an existing config to match the current schema. Start with CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml for a simple example, or TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml if your evaluator needs structured span data.

phoenix-evals-new-metric

Creating a New Built-in Classification Evaluator

Step 0: Gather Requirements

Step 1: Create the YAML Config

More from arize-ai/phoenix

phoenix-cli

phoenix-tracing

phoenix-evals

agent-browser

vercel-react-best-practices

phoenix-skill-development