bmad-eval-runner

Fail

Audited by Gen Agent Trust Hub on May 14, 2026

Risk Level: HIGHCREDENTIALS_UNSAFECOMMAND_EXECUTIONDATA_EXFILTRATION
Full Analysis
  • [CREDENTIALS_UNSAFE]: The skill extracts Claude Code OAuth credentials from the macOS Keychain.
  • Evidence: scripts/utils.py contains the function read_macos_keychain_credentials which executes security find-generic-password -s "Claude Code-credentials" -w to retrieve session tokens.
  • Evidence: These credentials are then written to a plain-text .credentials.json file within the evaluation workspace in scripts/run_evals.py and scripts/run_triggers.py.
  • [DATA_EXFILTRATION]: The skill implements a policy where run artifacts are never deleted, leading to long-term exposure of sensitive data.
  • Evidence: SKILL.md states 'Artifacts are forever. Never delete, overwrite, or rotate run folders.' This means the extracted OAuth tokens remain in the ~/bmad-evals/ directory indefinitely in plain text.
  • [COMMAND_EXECUTION]: The skill makes extensive use of the subprocess module and PTYs to execute shell commands on the host and inside containers.
  • Evidence: scripts/pty_runner.py uses pty.openpty() to simulate an interactive terminal for claude, allowing it to capture output while bypasses certain interactive restrictions.
  • Evidence: Numerous subprocess.run and subprocess.Popen calls are used for Docker management, rsync operations, and CLI execution across scripts/run_evals.py, scripts/docker_setup.py, and scripts/utils.py.
  • [REMOTE_CODE_EXECUTION]: The skill dynamically generates shell scripts as strings and executes them inside Docker containers using bash -c.
  • Evidence: In scripts/run_evals.py, the container_script variable is a multi-line shell script that incorporates variables like SKILL_NAME and SKILL_SRC through string interpolation. If a tested skill has a malicious name or path, it could lead to command injection within the container context.
  • [PROMPT_INJECTION]: The skill acts as an intermediary, passing prompts from evals.json directly into an isolated Claude instance.
  • Evidence: This creates a surface for Indirect Prompt Injection (Category 8) where a malicious evaluation file could attempt to compromise the evaluator or the host environment if the isolation layers (Docker/Local) were bypassed.
Recommendations
  • AI detected serious security threats
Audit Metadata
Risk Level
HIGH
Analyzed
May 14, 2026, 11:24 AM
Security Audit — agent-trust-hub — bmad-eval-runner