evaluation
Pass
Audited by Gen Agent Trust Hub on May 2, 2026
Risk Level: SAFE
Full Analysis
- [SAFE]: The skill consists of documentation and templates for LLM evaluation. It contains no malicious instructions or hidden payloads.
- [COMMAND_EXECUTION]: The main SKILL.md provides a workflow involving a bash command to run an evaluation script (python evaluate.py). This is a standard developer workflow for running automated tests.
- [INDIRECT_PROMPT_INJECTION]: The skill provides a surface for processing untrusted data (LLM outputs) during the evaluation process. Ingestion points: The system reads test cases and model outputs for evaluation. Boundary markers: The provided judge prompts use markdown headers (e.g., ## Original Input) as delimiters. Capability inventory: The skill uses Read, Write, and Bash tools for managing test suites and logging results. Sanitization: Implementation examples demonstrate the use of Pydantic and Zod for strict schema validation and structured parsing of judge outputs.
- [EXTERNAL_DOWNLOADS]: The implementation guides reference standard, well-known libraries such as the Anthropic SDK, Pydantic, and Zod which are standard industry tools for building reliable AI applications.
Audit Metadata