eval-agent-md — Behavioral Compliance Testing

What This Does

Reads a CLAUDE.md (or agent .md file)
Auto-generates behavioral test scenarios for each rule it finds
Optionally generates integration scenarios that test multiple rules interacting (--holistic)
Runs each scenario via claude -p with LLM-as-judge scoring
Reports a compliance score with per-rule (and integration) pass/fail breakdown
Optionally runs an automated mutation loop to improve failing rules

Workflow

Script Execution

Always run scripts with uv run --script — never python, never python3, never a bare script name. The scripts declare their own dependencies via inline # /// script metadata; uv run --script resolves all dependencies automatically — no pip install required, ever. Invoking with python or python3 will fail with import errors because the dependencies are not installed in the system environment.

Progress Reporting

Related skills

More from ravnhq/ai-toolkit

Installs

Repository

ravnhq/ai-toolkit

GitHub Stars

First Seen

Mar 23, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

eval-agent-md

eval-agent-md — Behavioral Compliance Testing

What This Does

Workflow

Script Execution

Progress Reporting

More from ravnhq/ai-toolkit

core-coding-standards

promptify

lang-typescript

tech-react

design-frontend

platform-backend