improvement-evaluator
Pass
Audited by Gen Agent Trust Hub on Apr 8, 2026
Risk Level: SAFECOMMAND_EXECUTIONPROMPT_INJECTION
Full Analysis
- [COMMAND_EXECUTION]: The skill uses the
subprocess.runfunction inscripts/evaluate.py,scripts/task_runner.py, andinterfaces/judges.pyto execute external commands. This includes calling theclaudeCLI for task execution and thepytestframework for output verification. These operations are core to the skill's purpose as a benchmarking utility. - [PROMPT_INJECTION]: The skill possesses an indirect prompt injection surface because it ingests untrusted data from YAML-formatted task suites to construct LLM prompts.
- Ingestion points: Prompt text and evaluation rubrics are loaded from task suite files located in the
task_suites/directory. - Boundary markers: In
scripts/task_runner.py, skill content is delimited with---BEGIN SKILL.MD---and---END SKILL.MD---markers before the task prompt is appended. - Capability inventory: The skill has the capability to execute shell commands and file system operations via its task runner and judge interfaces.
- Sanitization: The
PytestJudgeininterfaces/judges.pyimplements path traversal checks usingPath.resolve()to ensure that test execution is strictly confined to thetests/fixtures/directory, mitigating the risk of executing arbitrary system files.
Audit Metadata