writing-evals
Writing Evals
You write evaluations that prove AI capabilities work. Evals are the test suite for non-deterministic systems — they measure whether a capability still behaves correctly after every change.
If the task function uses the Vercel AI SDK, load the ai-sdk skill for correct generateText/streamText patterns.
Philosophy
- Evals are tests for AI. Every eval answers: "does this capability still work?"
- Scorers are assertions. Each scorer checks one property of the output.
- Data drives coverage. Happy path, adversarial, boundary, and negative cases.
- Read code first, ask later. Inspect the codebase and infer everything you can before asking.
How to Start
When the user asks you to write evals for an AI feature, read the code first.
More from maxmurr/skills
tdd
Guides agent through test-driven development using red-green-refactor. Use when user mentions TDD, red-green-refactor, test-first development, outside-in TDD, mockist TDD, London-school TDD, acceptance TDD, or double-loop TDD. Do not use for writing E2E/Playwright tests, configuring test runners or frameworks, adding tests without TDD methodology, or general testing advice.
10index-knowledge
Generate hierarchical AGENTS.md knowledge base for a codebase (root + complexity-scored subdirs), then align CLAUDE.md symlinks so Cursor/Claude see the same content. Use when user runs /index-knowledge, asks to regenerate AGENTS.md hierarchy, or refresh codebase knowledge docs.
8prompt-caching
Implements LLM prompt caching (KV caching) for Anthropic, OpenAI, and Google Gemini APIs. Use when optimizing LLM API costs with cached prompts, reducing time-to-first-token latency, adding cache_control breakpoints, debugging cached vs uncached token usage, or structuring prompts for maximum cache hit rates. Do not use for HTTP response caching, CDN or browser caching, Redis or in-memory application caching, LLM fine-tuning, embedding caching, or general API integration without caching concerns.
2prompt-master
Generates effective, well-structured prompts for LLMs using the Anthropic Prompt Template technique. Use when the user wants to create a new LLM prompt, restructure an existing prompt, or improve prompt quality. Do not use for general text writing, non-LLM content generation, prompt debugging, prompt evaluation, or running/testing prompts.
2prd-to-issues
Break a PRD into independently-grabbable Linear issues using tracer-bullet vertical slices. Use when user wants to convert a PRD to issues, create implementation tickets, or break down a PRD into work items.
1write-a-prd
Create a PRD through user interview, codebase exploration, and module design, then submit as a Linear issue. Use when user wants to write a PRD, create a product requirements document, or plan a new feature.
1