arize-experiment
Arize Experiment Skill
SPACE— All--spaceflags and theARIZE_SPACEenv var accept a space name (e.g.,my-workspace) or a base64 space ID (e.g.,U3BhY2U6...). Find yours withax spaces list.
Concepts
- Experiment = a named evaluation run against a specific dataset version, containing one run per example
- Experiment Run = the result of processing one dataset example -- includes the model output, optional evaluations, and optional metadata
- Dataset = a versioned collection of examples; every experiment is tied to a dataset and a specific dataset version
- Evaluation = a named metric attached to a run (e.g.,
correctness,relevance), with optional label, score, and explanation
The typical flow: export a dataset → process each example → collect outputs and evaluations → create an experiment with the runs.
Prerequisites
Proceed directly with the task — run the ax command you need. Do NOT check versions, env vars, or profiles upfront.
If an ax command fails, troubleshoot based on the error:
command not foundor version error → see references/ax-setup.md
More from arize-ai/arize-skills
arize-instrumentation
Adds Arize AX tracing to an LLM application for the first time. Follows a two-phase agent-assisted flow to analyze the codebase then implement instrumentation after user confirmation. Use when the user wants to instrument their app, add tracing from scratch, set up LLM observability, integrate OpenTelemetry or openinference, or get started with Arize tracing.
549arize-prompt-optimization
Optimizes, improves, and debugs LLM prompts using production trace data, evaluations, and annotations. Extracts prompts from spans, gathers performance signal, and runs a data-driven optimization loop using the ax CLI. Use when the user mentions optimize prompt, improve prompt, make AI respond better, improve output quality, prompt engineering, prompt tuning, or system prompt improvement.
453arize-trace
Downloads, exports, and inspects existing Arize traces and spans to understand what an LLM app is doing or debug runtime issues. Covers exporting traces by ID, spans by ID, sessions by ID, and root-cause investigation using the ax CLI. Use when the user wants to look at existing trace data, see what their LLM app is doing, export traces, download spans, investigate errors, or analyze behavior regressions.
433arize-dataset
Creates, manages, and queries Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI. Use when the user needs test data, evaluation examples, or mentions create dataset, list datasets, export dataset, append examples, dataset version, golden dataset, or test set.
409arize-link
Generates deep links to the Arize UI for traces, spans, sessions, datasets, labeling queues, evaluators, and annotation configs. Produces clickable URLs for sharing Arize resources with team members. Use when the user wants to link to or open a trace, span, session, dataset, evaluator, or annotation config in the Arize UI.
403arize-evaluator
Handles LLM-as-judge evaluation workflows on Arize including creating/updating evaluators, running evaluations on spans or experiments, managing tasks, trigger-run operations, column mapping, and continuous monitoring. Use when the user mentions create evaluator, LLM judge, hallucination, faithfulness, correctness, relevance, run eval, score spans, score experiment, trigger-run, column mapping, continuous monitoring, or improve evaluator prompt.
353