The Agent Skills Directory

[COMMAND_EXECUTION]: The eval.sh harness executes several CLI tools including claude, opencode, python3, jq, and perl. These are utilized for core functionality like running LLM inferences and data processing. While these are appropriate for the tool's purpose, they provide a mechanism for external interaction.
[PROMPT_INJECTION]: The skill exhibits an indirect prompt injection surface. Data from golden_examples.yaml is used to construct prompts for both the agent and the judge LLM without sufficient isolation.
Ingestion points: Scenario data (queries, context, behavior) enters the system via golden_examples.yaml.
Boundary markers: The prompts use basic separators like '---' but lack explicit instructions for the judge LLM to disregard potentially adversarial content within the test data.
Capability inventory: The system can execute CLI commands (claude, opencode) to perform network-based LLM inferences.
Sanitization: There is no evidence of input validation or sanitization for the strings interpolated into the LLM prompts.

skill-testing