testing-llm
LLM & AI Testing Patterns
Patterns and tools for testing LLM integrations, evaluating AI output quality, mocking responses for deterministic CI, and applying agentic test workflows (planner, generator, healer).
Quick Reference
| Area | File | Purpose |
|---|---|---|
| Rules | rules/llm-evaluation.md |
DeepEval quality metrics, Pydantic schema validation, timeout testing |
| Rules | rules/llm-mocking.md |
Mock LLM responses, VCR.py recording, custom request matchers |
| Reference | references/deepeval-ragas-api.md |
Full API reference for DeepEval and RAGAS metrics |
| Reference | references/generator-agent.md |
Transforms Markdown specs into Playwright tests |
| Reference | references/healer-agent.md |
Auto-fixes failing tests (selectors, waits, dynamic content) |
| Reference | references/planner-agent.md |
Explores app and produces Markdown test plans |
| Checklist | checklists/llm-test-checklist.md |
Complete LLM testing checklist (setup, coverage, CI/CD) |
| Example | examples/llm-test-patterns.md |
Full examples: mocking, structured output, DeepEval, VCR, golden datasets |
When to Use This Skill
More from yonatangross/skillforge-claude-plugin
zustand-patterns
Reference for Zustand 5.x state management including slices, middleware, Immer, useShallow, persistence, selectors, and devtools integration. Documents 7 core patterns with TypeScript examples and anti-patterns. Use when building React state management with Zustand instead of Redux.
45domain-driven-design
Domain-Driven Design tactical patterns for complex business domains. Use when modeling entities, value objects, domain services, repositories, or establishing bounded contexts.
38doctor
OrchestKit doctor for health diagnostics. Use when running checks on plugin health, diagnosing problems, or troubleshooting issues.
37react-server-components-framework
Use when building Next.js 16+ apps with React Server Components. Covers App Router, Cache Components (replacing experimental_ppr), streaming SSR, Server Actions, and React 19 patterns for server-first architecture.
36skill-evolution
Tracks skill usage patterns, edit frequency, and success rates to suggest improvements and optimizations. Manages skill versioning with safe rollback capability and confidence scoring for suggestions. Use when reviewing skill performance, applying auto-suggested changes, or rolling back problematic versions.
36architecture-decision-record
Use this skill when documenting significant architectural decisions. Provides ADR templates following the Nygard format with sections for context, decision, consequences, and alternatives. Use when writing ADRs, recording decisions, or evaluating options.
36