The Agent Skills Directory

[COMMAND_EXECUTION]: The skill uses the subprocess.run function in scripts/evaluate.py, scripts/task_runner.py, and interfaces/judges.py to execute external commands. This includes calling the claude CLI for task execution and the pytest framework for output verification. These operations are core to the skill's purpose as a benchmarking utility.
[PROMPT_INJECTION]: The skill possesses an indirect prompt injection surface because it ingests untrusted data from YAML-formatted task suites to construct LLM prompts.
Ingestion points: Prompt text and evaluation rubrics are loaded from task suite files located in the task_suites/ directory.
Boundary markers: In scripts/task_runner.py, skill content is delimited with ---BEGIN SKILL.MD--- and ---END SKILL.MD--- markers before the task prompt is appended.
Capability inventory: The skill has the capability to execute shell commands and file system operations via its task runner and judge interfaces.
Sanitization: The PytestJudge in interfaces/judges.py implements path traversal checks using Path.resolve() to ensure that test execution is strictly confined to the tests/fixtures/ directory, mitigating the risk of executing arbitrary system files.

improvement-evaluator