AI Eval in CI

Overview

Test AI agents and LLM outputs the same way you test code — automated evaluations that run in CI, compare against baselines, and fail the build when quality drops. No dashboards to check manually. Just npx eval run --ci and a red or green build.

When to Use

Adding quality gates before deploying AI features to production
Catching prompt regressions when system prompts or models change
Comparing model performance (GPT-4o vs Claude Sonnet vs local Llama)
Validating RAG pipeline accuracy against a test dataset
Benchmarking agent tool-calling accuracy and latency

Instructions

Strategy 1: Promptfoo (Config-Driven Evals)

Promptfoo is the most popular open-source eval framework. Define test cases in YAML, run against multiple providers, get a comparison matrix.

Related skills

ai-eval-ci

AI Eval in CI

Overview

When to Use

Instructions

Strategy 1: Promptfoo (Config-Driven Evals)

More from terminalskills/skills

api-tester

instagram-marketing

directus

coolify

agent-memory

reddit-insights