The Agent Skills Directory

[PROMPT_INJECTION]: Reference files include examples of adversarial prompts such as 'Pretend you're DAN' and 'Authority Escalation' (in references/scenario-red-team-evaluation-jailbreak-attack-.md). These strings are explicitly categorized as research data for building red-team suites and testing model defenses, rather than instructions to the agent.
[REMOTE_CODE_EXECUTION]: The skill provides Python code snippets for mechanistic interpretability research (in references/scenario-red-team-evaluation-jailbreak-attack-.md). This code serves as instructional documentation for the user and does not contain instructions for the agent to execute code at runtime.

ai-safety-researcher