The Agent Skills Directory

[COMMAND_EXECUTION]: The skill utilizes shell commands and local scripts to aggregate session data for evaluation.
Evidence: Documented use of cat, jq, and npx tsx eval.ts to process session JSON files and execution traces.
[DATA_EXFILTRATION]: Execution traces and task data are transmitted to external LLM providers (OpenAI, Azure) for scoring purposes. This is a standard operation for this type of tool but involves external data sharing.
Evidence: curl commands targeting api.openai.com and Azure endpoint configurations in agent-eval.yaml.
[PROMPT_INJECTION]: The skill processes untrusted agent outputs and execution traces within an evaluation template, creating a surface for indirect prompt injection.
Ingestion points: TASK, TRACE, and OUTPUT variables extracted from .reflection/session_*.json and shell environment.
Boundary markers: None present in the evaluation prompt template to distinguish untrusted content.
Capability inventory: Network access via curl and file access via cat.
Sanitization: No sanitization or validation of the agent-generated trace content is implemented before it is sent to the LLM judge.

agent-evaluation