Agent Evaluation with MLflow

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.

⛔ CRITICAL: Must Use MLflow APIs

DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:

Datasets: Use mlflow.genai.datasets.create_dataset() - NOT custom test case files
Scorers: Use mlflow.genai.scorers and mlflow.genai.judges.make_judge() - NOT custom scorer functions
Evaluation: Use mlflow.genai.evaluate() - NOT custom evaluation loops
Scripts: Use the provided scripts/ directory templates - NOT custom evaluation/ directories

Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.

If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.

agent-evaluation

Agent Evaluation with MLflow

⛔ CRITICAL: Must Use MLflow APIs

Table of Contents

More from mlflow/skills

searching-mlflow-docs

instrumenting-with-mlflow-tracing

mlflow-onboarding

analyzing-mlflow-session

retrieving-mlflow-traces

analyzing-mlflow-trace