ai-safety-researcher
Pass
Audited by Gen Agent Trust Hub on Apr 21, 2026
Risk Level: SAFE
Full Analysis
- [PROMPT_INJECTION]: Reference files include examples of adversarial prompts such as 'Pretend you're DAN' and 'Authority Escalation' (in references/scenario-red-team-evaluation-jailbreak-attack-.md). These strings are explicitly categorized as research data for building red-team suites and testing model defenses, rather than instructions to the agent.
- [REMOTE_CODE_EXECUTION]: The skill provides Python code snippets for mechanistic interpretability research (in references/scenario-red-team-evaluation-jailbreak-attack-.md). This code serves as instructional documentation for the user and does not contain instructions for the agent to execute code at runtime.
Audit Metadata