grpo-rl-training
Fail
Audited by Gen Agent Trust Hub on Apr 30, 2026
Risk Level: HIGHREMOTE_CODE_EXECUTIONCOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The file examples/reward_functions_library.py implements a reward function that executes code generated by the language model. The run_test_cases function uses the Python exec() function to evaluate strings extracted from model completions. AI-generated code is inherently untrusted; executing it without a robust sandbox (such as Docker or a gVisor-based runtime) allows the model to perform any action the user running the training script can, including file system modification or network access.
- [REMOTE_CODE_EXECUTION]: The combination of processing external datasets and executing generated code creates a path for remote code execution. The training pipeline in templates/basic_grpo_training.py ingests data via the load_dataset function, which is then used to prompt the model. If used in conjunction with the code execution reward from the library, an attacker-controlled dataset could lead to arbitrary code execution on the host machine.
- [PROMPT_INJECTION]: The skill is vulnerable to indirect prompt injection because it automatically executes model outputs. 1. Ingestion points: Training data is loaded in templates/basic_grpo_training.py using the get_dataset function. 2. Boundary markers: The skill uses XML-style tags (, ) to structure model responses, but these do not prevent a model from generating malicious Python code within the tags. 3. Capability inventory: The system has the capability to execute Python code via the exec() call in examples/reward_functions_library.py. 4. Sanitization: There is no evidence of code sanitization, static analysis, or sandboxing of the generated code before it is passed to exec().
Recommendations
- AI detected serious security threats
Audit Metadata