evaluations
Installation
SKILL.md
Set Up Evaluations for Your Agent
LangWatch Evaluations is a comprehensive QA system. Map the user's request to one branch:
| User says... | They need... | Go to... |
|---|---|---|
| "test my agent", "benchmark", "compare models" | Experiments | Step A |
| "monitor production", "track quality", "block harmful content", "safety" | Online Evaluation (includes guardrails) | Step B |
| "create an evaluator", "scoring function" | Evaluators | Step C |
| "create a dataset", "test data" | Datasets | Step D |
| "evaluate" (ambiguous) | Ask: "batch test or production monitoring?" | - |
Where Evaluations Fit
Evaluations sit at the component level of the testing pyramid — they test specific aspects of an agent with many input/output examples. Different from scenarios (end-to-end multi-turn).
Use evaluations when you have many examples with clear correct answers, or for CI quality gates. Use scenarios for multi-turn behavior and tool-calling sequences.