phoenix-evals-new-metric

Installation
SKILL.md

Creating a New Built-in Classification Evaluator

A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.

Step 0: Gather Requirements

Before writing anything, clarify with the user:

  1. What does this evaluator measure? Get a one-sentence description of the quality dimension.
  2. What input data is available? This determines the template placeholders (e.g., {{input}}, {{output}}, {{reference}}, {{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller.
  3. What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
  4. Should this appear in the dataset experiments UI? If yes, it needs the promoted_dataset_evaluator label. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.

Step 1: Create the YAML Config

Create prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.

Read an existing config to match the current schema. Start with CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml for a simple example, or TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml if your evaluator needs structured span data.

Related skills

More from arize-ai/phoenix

Installs
4
GitHub Stars
9.6K
First Seen
Mar 21, 2026