evaluating-llms-harness

Warn

Audited by Gen Agent Trust Hub on Apr 30, 2026

Risk Level: MEDIUMCOMMAND_EXECUTIONREMOTE_CODE_EXECUTIONEXTERNAL_DOWNLOADS
Full Analysis
  • [COMMAND_EXECUTION]: The skill instructions include Python code snippets that use os.system() to invoke shell commands and external scripts during the training and evaluation loops (e.g., in SKILL.md workflows).
  • [REMOTE_CODE_EXECUTION]: The references/benchmark-guide.md file explicitly instructs the agent to use the --allow_code_execution flag when running the HumanEval task. This flag enables the execution of code generated by the LLM, which is a significant security risk if the model is untrusted.
  • [REMOTE_CODE_EXECUTION]: The references/custom-tasks.md guide documents the use of the !function YAML tag to dynamically load and execute Python functions from a local utils.py file, creating a vector for executing arbitrary logic.
  • [COMMAND_EXECUTION]: The skill provides and encourages the use of shell scripts (e.g., eval_checkpoint.sh, eval_all_models.sh) to automate model benchmarking, which involves direct shell interaction.
  • [EXTERNAL_DOWNLOADS]: The skill documentation facilitates the download of various models and datasets from Hugging Face and other remote repositories as part of the standard evaluation process.
  • [CREDENTIALS_UNSAFE]: In references/api-evaluation.md, the documentation provides an example of disabling SSL verification using verify_certificate=false for development purposes, which could expose the agent to man-in-the-middle attacks when communicating with API endpoints.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Apr 30, 2026, 03:34 PM
Security Audit — agent-trust-hub — evaluating-llms-harness