improvement-evaluator

Pass

Audited by Gen Agent Trust Hub on Apr 8, 2026

Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
  • [COMMAND_EXECUTION]: The skill uses the subprocess.run function in scripts/evaluate.py, scripts/task_runner.py, and interfaces/judges.py to execute external commands. This includes calling the claude CLI for task execution and the pytest framework for output verification. These operations are core to the skill's purpose as a benchmarking utility.
  • [PROMPT_INJECTION]: The skill possesses an indirect prompt injection surface because it ingests untrusted data from YAML-formatted task suites to construct LLM prompts.
  • Ingestion points: Prompt text and evaluation rubrics are loaded from task suite files located in the task_suites/ directory.
  • Boundary markers: In scripts/task_runner.py, skill content is delimited with ---BEGIN SKILL.MD--- and ---END SKILL.MD--- markers before the task prompt is appended.
  • Capability inventory: The skill has the capability to execute shell commands and file system operations via its task runner and judge interfaces.
  • Sanitization: The PytestJudge in interfaces/judges.py implements path traversal checks using Path.resolve() to ensure that test execution is strictly confined to the tests/fixtures/ directory, mitigating the risk of executing arbitrary system files.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 8, 2026, 03:25 AM