bmad-eval-runner
Fail
Audited by Gen Agent Trust Hub on May 14, 2026
Risk Level: HIGHCREDENTIALS_UNSAFECOMMAND_EXECUTIONDATA_EXFILTRATION
Full Analysis
- [CREDENTIALS_UNSAFE]: The skill extracts Claude Code OAuth credentials from the macOS Keychain.
- Evidence:
scripts/utils.pycontains the functionread_macos_keychain_credentialswhich executessecurity find-generic-password -s "Claude Code-credentials" -wto retrieve session tokens. - Evidence: These credentials are then written to a plain-text
.credentials.jsonfile within the evaluation workspace inscripts/run_evals.pyandscripts/run_triggers.py. - [DATA_EXFILTRATION]: The skill implements a policy where run artifacts are never deleted, leading to long-term exposure of sensitive data.
- Evidence:
SKILL.mdstates 'Artifacts are forever. Never delete, overwrite, or rotate run folders.' This means the extracted OAuth tokens remain in the~/bmad-evals/directory indefinitely in plain text. - [COMMAND_EXECUTION]: The skill makes extensive use of the
subprocessmodule and PTYs to execute shell commands on the host and inside containers. - Evidence:
scripts/pty_runner.pyusespty.openpty()to simulate an interactive terminal forclaude, allowing it to capture output while bypasses certain interactive restrictions. - Evidence: Numerous
subprocess.runandsubprocess.Popencalls are used for Docker management,rsyncoperations, and CLI execution acrossscripts/run_evals.py,scripts/docker_setup.py, andscripts/utils.py. - [REMOTE_CODE_EXECUTION]: The skill dynamically generates shell scripts as strings and executes them inside Docker containers using
bash -c. - Evidence: In
scripts/run_evals.py, thecontainer_scriptvariable is a multi-line shell script that incorporates variables likeSKILL_NAMEandSKILL_SRCthrough string interpolation. If a tested skill has a malicious name or path, it could lead to command injection within the container context. - [PROMPT_INJECTION]: The skill acts as an intermediary, passing prompts from
evals.jsondirectly into an isolated Claude instance. - Evidence: This creates a surface for Indirect Prompt Injection (Category 8) where a malicious evaluation file could attempt to compromise the evaluator or the host environment if the isolation layers (Docker/Local) were bypassed.
Recommendations
- AI detected serious security threats
Audit Metadata