mlflow-genai-evaluation
Installation
SKILL.md
MLflow GenAI Evaluation Patterns
Production-grade patterns for evaluating Databricks GenAI agents using MLflow 3.0+ mlflow.genai.evaluate() with LLM-as-judge scorers and custom evaluation metrics.
When to Use
- Implementing agent evaluation pipelines with LLM judges
- Creating custom domain-specific evaluation scorers
- Setting up evaluation datasets for agent testing
- Checking deployment thresholds before production deployment
- Troubleshooting evaluation errors (0.0 scores, metric name mismatches)
- Optimizing guidelines for better evaluation scores
- Querying evaluation results programmatically
- Aligning LLM judges with domain expert feedback via MemAlign
- Automated prompt optimization with GEPA (
optimize_prompts()) - Setting up Unity Catalog trace ingestion for production monitoring