detecting-ai-model-prompt-injection-attacks
Detecting AI Model Prompt Injection Attacks
When to Use
- Scanning user inputs to LLM-powered applications before they are forwarded to the model
- Building an input validation layer for chatbots, AI agents, or retrieval-augmented generation (RAG) pipelines
- Monitoring logs of LLM interactions to retrospectively identify prompt injection attempts
- Evaluating the effectiveness of existing prompt injection defenses through red-team testing
- Classifying prompt injection payloads during security incident investigations involving AI systems
Do not use as the sole defense mechanism against prompt injection -- always combine with output validation, privilege separation, and least-privilege tool access. Not suitable for detecting jailbreaks that do not involve injection of adversarial instructions.
Prerequisites
- Python 3.10+ with pip for installing detection dependencies
- The
transformersandtorchlibraries for running the DeBERTa-based classifier model - The
protectai/deberta-v3-base-prompt-injection-v2model from Hugging Face (downloaded on first run, approximately 700 MB) - Network access to Hugging Face Hub for initial model download (offline mode supported after first download)
- Sample prompt injection payloads for testing (the script includes a built-in test suite)
More from mukul975/anthropic-cybersecurity-skills
acquiring-disk-image-with-dd-and-dcfldd
Create forensically sound bit-for-bit disk images using dd and dcfldd while preserving evidence integrity through
119analyzing-api-gateway-access-logs
Parses API Gateway access logs (AWS API Gateway, Kong, Nginx) to detect BOLA/IDOR attacks, rate limit bypass,
103analyzing-android-malware-with-apktool
Perform static analysis of Android APK malware samples using apktool for decompilation, jadx for Java source
99analyzing-cyber-kill-chain
Analyzes intrusion activity against the Lockheed Martin Cyber Kill Chain framework to identify which phases
90analyzing-email-headers-for-phishing-investigation
Parse and analyze email headers to trace the origin of phishing emails, verify sender authenticity, and identify
83analyzing-active-directory-acl-abuse
Detect dangerous ACL misconfigurations in Active Directory using ldap3 to identify GenericAll, WriteDACL, and
83