The Agent Skills Directory

[SAFE]: The skill consists of documentation and templates for LLM evaluation. It contains no malicious instructions or hidden payloads.
[COMMAND_EXECUTION]: The main SKILL.md provides a workflow involving a bash command to run an evaluation script (python evaluate.py). This is a standard developer workflow for running automated tests.
[INDIRECT_PROMPT_INJECTION]: The skill provides a surface for processing untrusted data (LLM outputs) during the evaluation process. Ingestion points: The system reads test cases and model outputs for evaluation. Boundary markers: The provided judge prompts use markdown headers (e.g., ## Original Input) as delimiters. Capability inventory: The skill uses Read, Write, and Bash tools for managing test suites and logging results. Sanitization: Implementation examples demonstrate the use of Pydantic and Zod for strict schema validation and structured parsing of judge outputs.
[EXTERNAL_DOWNLOADS]: The implementation guides reference standard, well-known libraries such as the Anthropic SDK, Pydantic, and Zod which are standard industry tools for building reliable AI applications.

evaluation