phoenix-evals-new-metric
Installation
SKILL.md
Creating a New Built-in Classification Evaluator
A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.
Step 0: Gather Requirements
Before writing anything, clarify with the user:
- What does this evaluator measure? Get a one-sentence description of the quality dimension.
- What input data is available? This determines the template placeholders (e.g.,
{{input}},{{output}},{{reference}},{{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller. - What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
- Should this appear in the dataset experiments UI? If yes, it needs the
promoted_dataset_evaluatorlabel. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.
Step 1: Create the YAML Config
Create prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.
Read an existing config to match the current schema. Start with CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml for a simple example, or TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml if your evaluator needs structured span data.