The Agent Skills Directory

[COMMAND_EXECUTION]: The skill unsets CLAUDECODE and CLAUDE_CODE_ENTRYPOINT environment variables and invokes claude -p with the --dangerously-skip-permissions flag. This configuration allows nested agent sessions to execute tools (Bash, Read, Write, etc.) autonomously without human approval, bypassing standard platform safety controls for the duration of the benchmark.
[REMOTE_CODE_EXECUTION]: The skill executes commands extracted from benchmark task files using subprocess.run in scripts/run_checks.py. Although the script implements an executable allowlist (e.g., python3, node) and filters for shell metacharacters, it remains a vector for executing arbitrary code logic defined in external task files.
[PROMPT_INJECTION]: The skill is vulnerable to indirect prompt injection (Category 8) because it ingests untrusted data from multiple sources and processes it using a subagent with tool access.
Ingestion points: Target skill's SKILL.md (Step 2), benchmark task definitions in the tasks/ directory, and session outputs in response.json.
Boundary markers: Absent. The instructions for the grader subagent in agents/grader.md do not include delimiters or instructions to ignore embedded commands or behavioral overrides within the ingested data.
Capability inventory: The parent agent and grader subagent possess Bash, Write, Edit, and Agent tool capabilities.
Sanitization: While scripts/run_checks.py validates verification commands, no sanitization or instruction filtering is applied to the natural language content processed by the subagents.

skill-benchmark