agent-evals

Pass

Audited by Gen Agent Trust Hub on Jun 18, 2026

Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATION
Full Analysis
  • [COMMAND_EXECUTION]: The evaluation scripts (evals/phase2-grader.py and evals/integration-test.sh) utilize shell commands to manage virtual environments and install dependencies required for the skill's test suite.
  • [COMMAND_EXECUTION]: The evals/phase2-grader.py script uses os.execv to restart its execution within a newly created virtual environment after verifying the availability of the Anthropic SDK.
  • [COMMAND_EXECUTION]: The evaluation fixture evals/fixtures/agent.py contains a subprocess.run call with shell=True. This file is explicitly documented as a mock workspace component used to test the skill's ability to diagnose and remediate un-sandboxed execution risks in user code.
  • [COMMAND_EXECUTION]: The autonomous-improve-loop.mjs template executes shell commands for running user-defined evaluations and benchmarking suites via spawnSync. These commands are configured through environment variables to provide flexibility in various CI/CD environments.
  • [COMMAND_EXECUTION]: The level-3-sandbox-harness.py template utilizes docker run to execute user-defined shell commands within an isolated container, implementing a recommended security control for AI agents with system access.
  • [EXTERNAL_DOWNLOADS]: The skill's test automation suite downloads and installs the anthropic and dspy-ai packages from official registries to facilitate its internal evaluation and integration testing.
  • [DATA_EXFILTRATION]: The autonomous-improve-loop.mjs template transmits allowlisted workspace file contents and execution traces to the OpenAI API for the purpose of generating improvement patches. The script includes a redaction mechanism designed to filter out API keys and secrets before data transmission.
Audit Metadata
Risk Level
SAFE
Analyzed
Jun 18, 2026, 09:38 PM
Security Audit — agent-trust-hub — agent-evals