ai-safety-researcher

Pass

Audited by Gen Agent Trust Hub on Apr 21, 2026

Risk Level: SAFE
Full Analysis
  • [PROMPT_INJECTION]: Reference files include examples of adversarial prompts such as 'Pretend you're DAN' and 'Authority Escalation' (in references/scenario-red-team-evaluation-jailbreak-attack-.md). These strings are explicitly categorized as research data for building red-team suites and testing model defenses, rather than instructions to the agent.
  • [REMOTE_CODE_EXECUTION]: The skill provides Python code snippets for mechanistic interpretability research (in references/scenario-red-team-evaluation-jailbreak-attack-.md). This code serves as instructional documentation for the user and does not contain instructions for the agent to execute code at runtime.
Audit Metadata
Risk Level
SAFE
Analyzed
Apr 21, 2026, 11:43 AM