pii-in-unstructured
Installation
SKILL.md
PII Detection in Unstructured Data
Overview
Unstructured data — emails, documents, images, chat logs, call transcripts, and system logs — accounts for an estimated 80% of enterprise data and presents the greatest challenge for privacy compliance. Unlike structured databases where personal data resides in known columns, unstructured data contains PII embedded in free text, attached files, scanned images, and metadata. This skill covers detection approaches using Named Entity Recognition (NER), pattern matching, OCR, and hybrid pipelines, with focus on Microsoft Presidio and spaCy as implementation frameworks.
Unstructured Data Sources at Vanguard Financial Services
| Source | Volume | PII Risk | Detection Challenge |
|---|---|---|---|
| Email (Exchange Online) | 2.1M messages/month | HIGH — names, account numbers, financial data in body and attachments | Mixed text and attachments; forwarded chains contain accumulated PII |
| SharePoint documents | 4.2TB across 1,200 sites | HIGH — contracts, KYC docs, customer correspondence | Multiple formats (docx, pdf, xlsx); embedded images |
| Teams chat | 890K messages/month | MEDIUM — casual references to customers, internal discussions | Short messages, abbreviations, context-dependent PII |
| Application logs | 50GB/day | MEDIUM — IP addresses, user IDs, error messages with PII | High volume, mixed with non-PII technical data |
| Scanned documents | 45K pages/month | HIGH — passport scans, signed contracts, medical certificates | Requires OCR; variable image quality |
| Call transcripts | 8K transcripts/month | HIGH — customers state names, account numbers, personal details | Speech-to-text errors, colloquial language |
| PDF reports | 12K documents/month | MEDIUM — financial reports may contain customer lists | Embedded tables, charts with PII labels |
Detection Architecture
Related skills