evaluating-llms-harness
Warn
Audited by Gen Agent Trust Hub on Apr 30, 2026
Risk Level: MEDIUMCOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
- [COMMAND_EXECUTION]: The skill instructions include Python code snippets that use
os.system()to invoke shell commands and external scripts during the training and evaluation loops (e.g., inSKILL.mdworkflows). - [REMOTE_CODE_EXECUTION]: The
references/benchmark-guide.mdfile explicitly instructs the agent to use the--allow_code_executionflag when running theHumanEvaltask. This flag enables the execution of code generated by the LLM, which is a significant security risk if the model is untrusted. - [REMOTE_CODE_EXECUTION]: The
references/custom-tasks.mdguide documents the use of the!functionYAML tag to dynamically load and execute Python functions from a localutils.pyfile, creating a vector for executing arbitrary logic. - [COMMAND_EXECUTION]: The skill provides and encourages the use of shell scripts (e.g.,
eval_checkpoint.sh,eval_all_models.sh) to automate model benchmarking, which involves direct shell interaction. - [EXTERNAL_DOWNLOADS]: The skill documentation facilitates the download of various models and datasets from Hugging Face and other remote repositories as part of the standard evaluation process.
- [CREDENTIALS_UNSAFE]: In
references/api-evaluation.md, the documentation provides an example of disabling SSL verification usingverify_certificate=falsefor development purposes, which could expose the agent to man-in-the-middle attacks when communicating with API endpoints.
Audit Metadata