Evaluate Your DSPy Program

Guide the user through measuring AI quality with DSPy's Evaluate class. The pattern: pick a metric, prepare a devset, run the evaluator, interpret results, then feed the same metric into an optimizer.

What is dspy.Evaluate

dspy.Evaluate runs your program on every devset example, scores each with a metric, and reports the aggregate score. It handles threading and progress display. Returns a percentage (0-100).

Built-in metrics

DSPy provides answer_exact_match (normalized string equality) and answer_passage_match (substring check). Both expect an answer field on example and prediction.

SemanticF1

Measures token-level overlap between the predicted and expected answer using an F1 score. More forgiving than exact match — it gives partial credit for answers that are close but not identical:

from dspy.evaluate import SemanticF1

dspy-evaluate

Evaluate Your DSPy Program

What is dspy.Evaluate

Built-in metrics

SemanticF1

More from lebsral/dspy-programming-not-prompting-lms-skills

ai-switching-models

ai-stopping-hallucinations

ai-do

ai-improving-accuracy

ai-parsing-data

ai-reasoning