caw-eval

Fail

Audited by Gen Agent Trust Hub on Apr 30, 2026

Risk Level: HIGHCOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: Orchestrator scripts run_eval_cc.py and run_eval_openclaw.py extensively use asyncio.create_subprocess_exec to run local and remote shell commands via gcloud compute ssh to manage the evaluation environment.
  • [COMMAND_EXECUTION]: The script run_eval_openclaw.py uses sudo systemctl restart to dynamically update environment variables for the gateway service on remote servers.
  • [COMMAND_EXECUTION]: The documentation in server-setup.md and references/common-execution.md instructs users to modify ~/.bashrc and ~/.ssh/config to persist environment settings and SSH tunnel configurations.
  • [REMOTE_CODE_EXECUTION]: server-setup.md provides instructions to download and execute setup scripts from well-known sources (NodeSource) and vendor repositories (raw.githubusercontent.com/CoboGlobal) via shell piping (curl | bash).
  • [EXTERNAL_DOWNLOADS]: Fetches binaries and dependencies from well-known repositories such as the NodeSource registry and the official CoboGlobal GitHub organization.
  • [DATA_EXFILTRATION]: Orchestrator scripts automatically retrieve session logs and wallet transaction specifications from remote instances using gcloud compute scp for local analysis.
  • [PROMPT_INJECTION]:
  • Ingestion points: score_traces.py ingests untrusted agent session data from Langfuse traces or local JSONL files.
  • Boundary markers: The judge_cc.py script embeds these logs into judge prompts using headers like [USER] and [ASSISTANT], but does not fully escape or encapsulate the untrusted content.
  • Capability inventory: The orchestration scripts possess high-privilege capabilities including arbitrary shell command execution and remote service management.
  • Sanitization: The logs undergo truncation for length in _build_session_text_from_observations but lack robust sanitization against embedded instructions targeting the LLM-as-Judge subagent.
Recommendations
  • HIGH: Downloads and executes remote code from: http://metadata.google.internal/computeMetadata/v1 - DO NOT USE without thorough review
Audit Metadata
Risk Level
HIGH
Analyzed
Apr 30, 2026, 02:21 AM