ai-moderating-content

Pass

Audited by Gen Agent Trust Hub on May 13, 2026

Risk Level: SAFEPROMPT_INJECTIONEXTERNAL_DOWNLOADS
Full Analysis
  • [PROMPT_INJECTION]: The skill is designed to ingest and process untrusted user-generated content (UGC), which creates a surface for indirect prompt injection attacks where malicious users might attempt to bypass moderation logic with embedded instructions. \n
  • Ingestion points: Untrusted data enters the agent context through the content field in ModerateContent (SKILL.md), comment in ModerateComment (examples.md), and title/description in ModerateListing (examples.md).\n
  • Boundary markers: The provided code examples do not include explicit prompt delimiters or specific instructions to the model to ignore instructions embedded within the user content.\n
  • Capability inventory: The skill produces a decision (e.g., 'remove', 'approve', 'reject') that would typically trigger automated actions in a host system, potentially leading to unauthorized approval of malicious content if an injection succeeds.\n
  • Sanitization: It uses regex to catch specific PII patterns like SSNs and email addresses, but it does not sanitize against semantic prompt injection payloads.
  • [EXTERNAL_DOWNLOADS]: The documentation suggests installing a related skill using npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do. This refers to a resource owned by the vendor lebsral.
Audit Metadata
Risk Level
SAFE
Analyzed
May 13, 2026, 06:45 PM