On-disk RAG evaluation (`corpus/` + `train.json`)

Purpose

Guide agents through NVIDIA RAG Blueprint filesystem benchmarks: preparing corpus/ and train.json, running scripts/eval/evaluate_rag.py, tuning retrieval and generation flags for quality comparisons, interpreting RAGAS JSON outputs, and triaging failures (HTTP/stream errors, empty contexts, collection mismatch, judge API).

For latency, throughput, and load testing, use the rag-perf skill (scripts/rag-perf, docs/performance-benchmarking.md) — not this skill.

When not to use

Do not use this skill for: deploying or repairing services (use rag-blueprint); evaluating APIs without the corpus/ + train.json layout; general ML experimentation unrelated to this evaluator; production monitoring/alerting; or latency/throughput benchmarking (use rag-perf).

Prerequisites

Repo cloned; run commands from repo root (imports and paths assume this).
Python 3.11+ and uv; eval deps: uv sync --project scripts/eval.
Reachable RAG server and ingestor (defaults often localhost:8081 / 8082).
NVIDIA_API_KEY for RAGAS (see credential hygiene); optional RAG_EVAL_JUDGE_MODEL.
Dataset roots passed to --dataset-paths each contain corpus/ and train.json.

rag-eval

On-disk RAG evaluation (corpus/ + train.json)

Purpose

When not to use

Prerequisites

On-disk RAG evaluation (`corpus/` + `train.json`)