agent-evaluation

Installation
SKILL.md

Agent Evaluation with MLflow

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.

⛔ CRITICAL: Must Use MLflow APIs

DO NOT create custom evaluation frameworks. You MUST use MLflow's native APIs:

  • Datasets: Use mlflow.genai.datasets.create_dataset() - NOT custom test case files
  • Scorers: Use mlflow.genai.scorers and mlflow.genai.judges.make_judge() - NOT custom scorer functions
  • Evaluation: Use mlflow.genai.evaluate() - NOT custom evaluation loops
  • Scripts: Use the provided scripts/ directory templates - NOT custom evaluation/ directories

Why? MLflow tracks everything (datasets, scorers, traces, results) in the experiment. Custom frameworks bypass this and lose all observability.

If you're tempted to create evaluation/eval_dataset.py or similar custom files, STOP. Use scripts/create_dataset_template.py instead.

Table of Contents

Related skills

More from mlflow/skills

Installs
276
Repository
mlflow/skills
GitHub Stars
41
First Seen
Feb 4, 2026