Evaluate RAG

Overview

Do error analysis on end-to-end traces first. Determine whether failures come from retrieval, generation, or both.
Build a retrieval evaluation dataset: queries paired with relevant document chunks.
Measure retrieval quality with Recall@k (most important for first-pass retrieval).
Evaluate generation separately: faithfulness (grounded in context?) and relevance (answers the query?).
If retrieval is the bottleneck, optimize chunking via grid search before tuning generation.

Prerequisites

Complete error analysis on RAG pipeline traces before selecting metrics. Inspect what was retrieved vs. what the model needed. Determine whether the problem is retrieval, generation, or both. Fix retrieval first.

Core Instructions

Evaluate Retrieval and Generation Separately

Measure each component independently. Use the appropriate metric for each retrieval stage:

evaluate-rag

Evaluate RAG

Overview

Prerequisites

Core Instructions

Evaluate Retrieval and Generation Separately