The Agent Skills Directory

[COMMAND_EXECUTION]: The skill's primary functionality relies on executing external CLI tools through shell commands.
Evidence: The Python scripts run_tests_claude.py, run_tests_opencode.py, and run_tests_codex.py use subprocess.run() to call claude, opencode, and codex respectively.
Evidence: The PowerShell script Invoke-SpecTests.ps1 executes the copilot CLI tool.
Risk: While the tools themselves are well-known, executing arbitrary commands from scripts provided within a skill carries inherent risks if the environment is not restricted.
[DATA_EXFILTRATION]: The skill reads local files and sends their contents to external LLM services for evaluation.
Evidence: The judge_prompt.md instructs the LLM to use a "Read tool" to access target files specified in the test frontmatter. The Python and PowerShell runners facilitate this by passing file paths or content to the LLM CLIs.
Risk: If a specification file points to sensitive local files (e.g., .env, SSH keys), those files will be read and their content processed by external LLM providers.
[PROMPT_INJECTION]: The skill uses complex system prompts and user-provided specifications to steer LLM behavior.
Evidence: judge_prompt.md contains strict behavioral directives such as "CRITICAL: You must respond with ONLY a JSON object" and "No other text... Do not wrap the JSON in backticks."
Risk: Maliciously crafted specifications could attempt to override these instructions to extract system prompts or bypass evaluation logic.
[INDIRECT_PROMPT_INJECTION]: The skill is vulnerable to indirect prompt injection from the files it evaluates.
Ingestion points: The target files specified in the YAML frontmatter of spec files (e.g., in specs/tests/authentication.md) are read and analyzed by the LLM judge.
Boundary markers: The judge_prompt.md uses BEGIN_ASSERTION and END_ASSERTION blocks to delimit test conditions, but the target file content itself is not strictly isolated or sanitized.
Capability inventory: The runners possess the capability to read any local file accessible to the user and perform network operations via the LLM CLI tools.
Sanitization: There is no evidence of sanitization or escaping of the target file content before it is processed by the LLM judge.
Risk: A target file could contain instructions designed to trick the LLM judge into reporting a "PASS" verdict regardless of actual implementation quality.

spec-tests