MLflow GenAI Evaluation Patterns

Production-grade patterns for evaluating Databricks GenAI agents using MLflow 3.0+ mlflow.genai.evaluate() with LLM-as-judge scorers and custom evaluation metrics.

When to Use

Implementing agent evaluation pipelines with LLM judges
Creating custom domain-specific evaluation scorers
Setting up evaluation datasets for agent testing
Checking deployment thresholds before production deployment
Troubleshooting evaluation errors (0.0 scores, metric name mismatches)
Optimizing guidelines for better evaluation scores
Querying evaluation results programmatically
Aligning LLM judges with domain expert feedback via MemAlign
Automated prompt optimization with GEPA (optimize_prompts())
Setting up Unity Catalog trace ingestion for production monitoring

Installs

Repository

databricks-solu…template

GitHub Stars

First Seen

Mar 8, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass