phoenix-evals
Phoenix Evals
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
Quick Reference
| Task | Files |
|---|---|
| Setup | setup-python, setup-typescript |
| Decide what to evaluate | evaluators-overview |
| Choose a judge model | fundamentals-model-selection |
| Use pre-built evaluators | evaluators-pre-built |
| Build code evaluator | evaluators-code-python, evaluators-code-typescript |
| Build LLM evaluator | evaluators-llm-python, evaluators-llm-typescript, evaluators-custom-templates |
| Batch evaluate DataFrame | evaluate-dataframe-python |
| Run experiment | experiments-running-python, experiments-running-typescript |
| Create dataset | experiments-datasets-python, experiments-datasets-typescript |
| Generate synthetic data | experiments-synthetic-python, experiments-synthetic-typescript |
| Validate evaluator accuracy | validation, validation-evaluators-python, validation-evaluators-typescript |
More from arize-ai/phoenix
phoenix-cli
Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, structure trace review with open coding and axial coding, inspect datasets, review experiments, query annotation configs, and use the GraphQL API. Use whenever the user is analyzing traces or spans, investigating LLM/agent failures, deciding what to do after instrumenting an app, building failure taxonomies, choosing what evals to write, or asking "what's going wrong", "what kinds of mistakes", or "where do I focus" — even without naming a technique.
494phoenix-tracing
OpenInference semantic conventions and instrumentation for Phoenix AI observability. Use when implementing LLM tracing, creating custom spans, or deploying to production.
488agent-browser
Browser automation CLI for AI agents. Use when the user needs to interact with websites, including navigating pages, filling forms, clicking buttons, taking screenshots, extracting data, testing web apps, or automating any browser task. Triggers include requests to "open a website", "fill out a form", "click a button", "take a screenshot", "scrape data from a page", "test this web app", "login to a site", "automate browser actions", or any task requiring programmatic web interaction.
65vercel-react-best-practices
React and Next.js performance optimization guidelines from Vercel Engineering. This skill should be used when writing, reviewing, or refactoring React/Next.js code to ensure optimal performance patterns. Triggers on tasks involving React components, Next.js pages, data fetching, bundle optimization, or performance improvements.
63phoenix-skill-development
Develop, refine, and maintain skills in the skills/ directory. Use when creating a new skill, updating an existing skill, adding rule files, or improving skill quality and consistency.
59mintlify
Build and maintain documentation sites with Mintlify. Use when
59