eval-driven-dev

Installation

SKILL.md

Eval-Driven Development for Python LLM Applications

You're building an automated evaluation pipeline that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via pixie test.

What you're testing is the app itself — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of assertEqual — but the thing under test is the app's code, not the LLM.

During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.

Rule: The app's LLM calls must go to a real LLM. Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.

The deliverable is a working pixie test run with real scores — not a plan, not just instrumentation, not just a dataset.

This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.

Before you start

First, activate the virtual environment. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources.

Related skills

More from github/awesome-copilot

Installs

2.5K

Repository

github/awesome-copilot

GitHub Stars

32.8K

First Seen

Mar 16, 2026

Security Audits

Gen Agent Trust HubWarn

SocketPass

SnykPass

eval-driven-dev

Eval-Driven Development for Python LLM Applications

Before you start

More from github/awesome-copilot

git-commit

gh-cli

prd

documentation-writer

excalidraw-diagram-generator

refactor