evaluations

Installation
SKILL.md

Set Up Evaluations for Your Agent

LangWatch Evaluations is a comprehensive QA system. Map the user's request to one branch:

User says... They need... Go to...
"test my agent", "benchmark", "compare models" Experiments Step A
"monitor production", "track quality", "block harmful content", "safety" Online Evaluation (includes guardrails) Step B
"create an evaluator", "scoring function" Evaluators Step C
"create a dataset", "test data" Datasets Step D
"evaluate" (ambiguous) Ask: "batch test or production monitoring?" -

Where Evaluations Fit

Evaluations sit at the component level of the testing pyramid — they test specific aspects of an agent with many input/output examples. Different from scenarios (end-to-end multi-turn).

Use evaluations when you have many examples with clear correct answers, or for CI quality gates. Use scenarios for multi-turn behavior and tool-calling sequences.

Determine Scope

Installs
71
GitHub Stars
2
First Seen
Mar 18, 2026
evaluations — langwatch/skills