run-evals

Warn

Audited by Gen Agent Trust Hub on Apr 6, 2026

Risk Level: MEDIUMREMOTE_CODE_EXECUTIONCREDENTIALS_UNSAFEEXTERNAL_DOWNLOADSPROMPT_INJECTION
Full Analysis
  • [REMOTE_CODE_EXECUTION]: Example 3 in references/end-to-end-examples.md uses exec(row.code) and eval(row.test_expression) to evaluate LLM-generated code. This pattern executes arbitrary code provided in the dataset, which could lead to system compromise if the dataset is maliciously crafted.
  • [CREDENTIALS_UNSAFE]: Step 10 in SKILL.md instructs users to clone Git repositories using URLs that contain API keys (e.g., https://user:YOUR_API_KEY@<git-url>). This practice is insecure as it exposes sensitive credentials in shell history, process listings, and Git configuration files.
  • [EXTERNAL_DOWNLOADS]: The skill requires the installation of the zeroeval Python package from external registries and fetches datasets from remote servers via ze.Dataset.pull().
  • [PROMPT_INJECTION]: The skill's workflow involves ingesting untrusted data from external datasets via ze.Dataset.pull(). This data is then interpolated into prompts and directly executed in evaluation scripts without proper isolation or sanitization.
  • Ingestion points: Dataset rows are pulled in SKILL.md and references/end-to-end-examples.md via the SDK.
  • Boundary markers: None used when interpolating row fields into LLM messages or code execution blocks.
  • Capability inventory: The skill possesses capabilities for arbitrary code execution (exec, eval) and network communication via the SDK and OpenAI API.
  • Sanitization: No validation or sanitization of dataset content is performed before use in execution or prompting.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Apr 6, 2026, 07:22 PM
Security Audit — agent-trust-hub — run-evals