The Agent Skills Directory

[COMMAND_EXECUTION]: The skill instructions include Python code snippets that use os.system() to invoke shell commands and external scripts during the training and evaluation loops (e.g., in SKILL.md workflows).
[REMOTE_CODE_EXECUTION]: The references/benchmark-guide.md file explicitly instructs the agent to use the --allow_code_execution flag when running the HumanEval task. This flag enables the execution of code generated by the LLM, which is a significant security risk if the model is untrusted.
[REMOTE_CODE_EXECUTION]: The references/custom-tasks.md guide documents the use of the !function YAML tag to dynamically load and execute Python functions from a local utils.py file, creating a vector for executing arbitrary logic.
[COMMAND_EXECUTION]: The skill provides and encourages the use of shell scripts (e.g., eval_checkpoint.sh, eval_all_models.sh) to automate model benchmarking, which involves direct shell interaction.
[EXTERNAL_DOWNLOADS]: The skill documentation facilitates the download of various models and datasets from Hugging Face and other remote repositories as part of the standard evaluation process.
[CREDENTIALS_UNSAFE]: In references/api-evaluation.md, the documentation provides an example of disabling SSL verification using verify_certificate=false for development purposes, which could expose the agent to man-in-the-middle attacks when communicating with API endpoints.

evaluating-llms-harness