phoenix-evals-new-metric
Creating a New Built-in Classification Evaluator
A built-in evaluator is a YAML config (source of truth) that gets compiled into Python and TypeScript code, wrapped in evaluator classes, benchmarked, and documented. The whole pipeline is linear — follow these steps in order.
Step 0: Gather Requirements
Before writing anything, clarify with the user:
- What does this evaluator measure? Get a one-sentence description of the quality dimension.
- What input data is available? This determines the template placeholders (e.g.,
{{input}},{{output}},{{reference}},{{tool_definitions}}). If the user is vague, ask follow-up questions — the placeholders are the contract between the evaluator and the caller. - What labels make sense? Binary is most common (e.g., correct/incorrect, faithful/unfaithful), but some metrics use more. Labels map to scores.
- Should this appear in the dataset experiments UI? If yes, it needs the
promoted_dataset_evaluatorlabel. Currently only correctness, tool_selection, and tool_invocation have this — some may new evaluators don't need it.
Step 1: Create the YAML Config
Create prompts/classification_evaluator_configs/{NAME}_CLASSIFICATION_EVALUATOR_CONFIG.yaml.
Read an existing config to match the current schema. Start with CORRECTNESS_CLASSIFICATION_EVALUATOR_CONFIG.yaml for a simple example, or TOOL_SELECTION_CLASSIFICATION_EVALUATOR_CONFIG.yaml if your evaluator needs structured span data.
More from arize-ai/phoenix
phoenix-cli
Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.
496phoenix-tracing
OpenInference semantic conventions and instrumentation for Phoenix AI observability. Use when implementing LLM tracing, creating custom spans, or deploying to production.
489phoenix-evals
Build and run evaluators for AI/LLM applications using Phoenix.
433agent-browser
Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.
65vercel-react-best-practices
React and Next.js performance optimization guidelines from Vercel Engineering. This skill should be used when writing, reviewing, or refactoring React/Next.js code to ensure optimal performance patterns. Triggers on tasks involving React components, Next.js pages, data fetching, bundle optimization, or performance improvements.
63phoenix-skill-development
Develop, refine, and maintain skills in the skills/ directory. Use when creating a new skill, updating an existing skill, adding rule files, or improving skill quality and consistency.
59