agent-evaluation

Warn

Audited by Gen Agent Trust Hub on Apr 23, 2026

Risk Level: MEDIUMCOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: Several utility scripts execute shell commands via subprocess.run to interact with system tools and cloud CLIs. \n
  • scripts/setup_mlflow.py and scripts/utils/env_validation.py run the databricks auth profiles command. \n
  • scripts/validate_environment.py executes the mlflow doctor diagnostic tool. \n- [COMMAND_EXECUTION]: The skill performs dynamic code execution and script generation. \n
  • scripts/create_dataset_template.py and scripts/run_evaluation_template.py use subprocess.run(['python', '-c', ...]) to run dynamically constructed Python snippets for metadata retrieval. \n
  • The skill generates new executable Python files (create_evaluation_dataset.py and run_agent_evaluation.py) from internal templates. \n
  • scripts/validate_tracing_runtime.py uses importlib.import_module to dynamically load agent code based on user-provided module names. \n- [COMMAND_EXECUTION]: The skill modifies file permissions on dynamically created scripts. \n
  • scripts/create_dataset_template.py and scripts/run_evaluation_template.py call os.chmod to grant execution privileges (0o755) to the generated scripts. \n- [PROMPT_INJECTION]: The skill is susceptible to indirect prompt injection because it processes external data for agent evaluation without adequate safeguards. \n
  • Ingestion points: Untrusted data from MLflow datasets is loaded into DataFrames and passed directly to agent entry points in the run_agent_evaluation.py script. \n
  • Boundary markers: There are no boundary markers or instructions used to prevent the agent from executing instructions embedded within the evaluation dataset. \n
  • Capability inventory: The evaluation environment allows for subprocess execution, file system access, and network communication via MLflow and LLM provider APIs. \n
  • Sanitization: The skill lacks mechanisms to sanitize or validate the content of the evaluation datasets before they are processed.
Audit Metadata
Risk Level
MEDIUM
Analyzed
Apr 23, 2026, 06:03 AM