experiment-loop
Warn
Audited by Gen Agent Trust Hub on Jun 22, 2026
Risk Level: MEDIUMCOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill executes arbitrary shell commands provided in the
measurement_cmdfield of experiment definitions. - Evidence: The documentation describes
measurement_cmdas a "Shell command that produces JSON with the metric value" and provides examples likenpm run bench:apiandpython eval/run_evals.py. - [REMOTE_CODE_EXECUTION]: The skill automates a process where one agent (
spark) modifies the codebase and a subsequent step executes the modified code via benchmarks or tests. - Evidence: The "5-Step Loop" explicitly includes a "MODIFY" phase followed by a "TEST" phase where measurements are run.
- [PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection if the configuration file (
thoughts/EXPERIMENTS.md) is poisoned with malicious commands. - Ingestion points: The skill reads experiment definitions from
thoughts/EXPERIMENTS.mdor the user's task description. - Boundary markers: Absent. There are no delimiters or warnings to prevent the agent from executing malicious instructions embedded in the
measurement_cmdfield. - Capability inventory: The skill has the ability to execute shell commands, modify files within a defined
scope, and perform git operations (git stash). - Sanitization: Absent. The skill does not validate or sanitize the shell commands before execution.
Audit Metadata