Agent Test

Designing the measurement an AI agent or skill is judged by — evals, LLM-as-judges, trajectory tests, held-out benchmarks, and activation evals. The agent-actor analog of human test design. Provenance lives in skill.json; this file is runtime routing only.

Produces: a change-plan.md (DO), an audit-report.md plus a findings-ledger + workflow-state when tracked (REVIEW), or a design-doc.md / refactor-runbook.md / explanation.md (DESIGN).

Boundaries

Do NOT use to operate or watch the loop these evals feed (use agent-ops), design the SDK/tool surface (use agent-dx), write agent-native docs (use agent-docs), or scaffold repo CI gates (use harden-repo-for-coding-agents), or to operate the eval/optimization loop, autonomy, and reliability (use agent-ops).

agent-test

Agent Test

Boundaries

Core principle