eval-agent-md
eval-agent-md — Behavioral Compliance Testing
What This Does
- Reads a CLAUDE.md (or agent .md file)
- Auto-generates behavioral test scenarios for each rule it finds
- Optionally generates integration scenarios that test multiple rules interacting (
--holistic) - Runs each scenario via
claude -pwith LLM-as-judge scoring - Reports a compliance score with per-rule (and integration) pass/fail breakdown
- Optionally runs an automated mutation loop to improve failing rules
Workflow
Script Execution
Always run scripts with uv run --script — never python, never python3, never a bare script name. The scripts declare their own dependencies via inline # /// script metadata; uv run --script resolves all dependencies automatically — no pip install required, ever. Invoking with python or python3 will fail with import errors because the dependencies are not installed in the system environment.
Progress Reporting
More from ravnhq/ai-toolkit
core-coding-standards
Universal code quality rules — KISS, DRY, clean code, code review. Base
81promptify
Transform user requests into detailed, precise prompts for AI models.
66lang-typescript
TypeScript language patterns and type safety rules — strict mode, no
53tech-react
React 19 patterns for components, hooks, Server Components, and data
52design-frontend
Visual design system patterns for web UIs. Tailwind CSS v4 design tokens
43platform-backend
Server-side architecture and security — API design, error handling, validation,
39