agent-evals
Pass
Audited by Gen Agent Trust Hub on Jun 18, 2026
Risk Level: SAFECOMMAND_EXECUTIONEXTERNAL_DOWNLOADSDATA_EXFILTRATION
Full Analysis
- [COMMAND_EXECUTION]: The evaluation scripts (
evals/phase2-grader.pyandevals/integration-test.sh) utilize shell commands to manage virtual environments and install dependencies required for the skill's test suite. - [COMMAND_EXECUTION]: The
evals/phase2-grader.pyscript usesos.execvto restart its execution within a newly created virtual environment after verifying the availability of the Anthropic SDK. - [COMMAND_EXECUTION]: The evaluation fixture
evals/fixtures/agent.pycontains asubprocess.runcall withshell=True. This file is explicitly documented as a mock workspace component used to test the skill's ability to diagnose and remediate un-sandboxed execution risks in user code. - [COMMAND_EXECUTION]: The
autonomous-improve-loop.mjstemplate executes shell commands for running user-defined evaluations and benchmarking suites viaspawnSync. These commands are configured through environment variables to provide flexibility in various CI/CD environments. - [COMMAND_EXECUTION]: The
level-3-sandbox-harness.pytemplate utilizesdocker runto execute user-defined shell commands within an isolated container, implementing a recommended security control for AI agents with system access. - [EXTERNAL_DOWNLOADS]: The skill's test automation suite downloads and installs the
anthropicanddspy-aipackages from official registries to facilitate its internal evaluation and integration testing. - [DATA_EXFILTRATION]: The
autonomous-improve-loop.mjstemplate transmits allowlisted workspace file contents and execution traces to the OpenAI API for the purpose of generating improvement patches. The script includes a redaction mechanism designed to filter out API keys and secrets before data transmission.
Audit Metadata