mlflow-genai-evaluation

Installation
SKILL.md

MLflow GenAI Evaluation Patterns

Production-grade patterns for evaluating Databricks GenAI agents using MLflow 3.0+ mlflow.genai.evaluate() with LLM-as-judge scorers and custom evaluation metrics.

When to Use

  • Implementing agent evaluation pipelines with LLM judges
  • Creating custom domain-specific evaluation scorers
  • Setting up evaluation datasets for agent testing
  • Checking deployment thresholds before production deployment
  • Troubleshooting evaluation errors (0.0 scores, metric name mismatches)
  • Optimizing guidelines for better evaluation scores
  • Querying evaluation results programmatically
  • Aligning LLM judges with domain expert feedback via MemAlign
  • Automated prompt optimization with GEPA (optimize_prompts())
  • Setting up Unity Catalog trace ingestion for production monitoring

Installs
1
GitHub Stars
2
First Seen
Mar 8, 2026
mlflow-genai-evaluation — databricks-solutions/vibe-coding-workshop-template