Set Up Evaluations for Your Agent

LangWatch Evaluations is a comprehensive QA system. Map the user's request to one branch:

User says...	They need...	Go to...
"test my agent", "benchmark", "compare models"	Experiments	Step A
"monitor production", "track quality", "block harmful content", "safety"	Online Evaluation (includes guardrails)	Step B
"create an evaluator", "scoring function"	Evaluators	Step C
"create a dataset", "test data"	Datasets	Step D
"evaluate" (ambiguous)	Ask: "batch test or production monitoring?"	-

Where Evaluations Fit

Evaluations sit at the component level of the testing pyramid — they test specific aspects of an agent with many input/output examples. Different from scenarios (end-to-end multi-turn).

Use evaluations when you have many examples with clear correct answers, or for CI quality gates. Use scenarios for multi-turn behavior and tool-calling sequences.

evaluations

Set Up Evaluations for Your Agent

Where Evaluations Fit

Determine Scope