pii-in-unstructured

Installation
SKILL.md

PII Detection in Unstructured Data

Overview

Unstructured data — emails, documents, images, chat logs, call transcripts, and system logs — accounts for an estimated 80% of enterprise data and presents the greatest challenge for privacy compliance. Unlike structured databases where personal data resides in known columns, unstructured data contains PII embedded in free text, attached files, scanned images, and metadata. This skill covers detection approaches using Named Entity Recognition (NER), pattern matching, OCR, and hybrid pipelines, with focus on Microsoft Presidio and spaCy as implementation frameworks.

Unstructured Data Sources at Vanguard Financial Services

Source Volume PII Risk Detection Challenge
Email (Exchange Online) 2.1M messages/month HIGH — names, account numbers, financial data in body and attachments Mixed text and attachments; forwarded chains contain accumulated PII
SharePoint documents 4.2TB across 1,200 sites HIGH — contracts, KYC docs, customer correspondence Multiple formats (docx, pdf, xlsx); embedded images
Teams chat 890K messages/month MEDIUM — casual references to customers, internal discussions Short messages, abbreviations, context-dependent PII
Application logs 50GB/day MEDIUM — IP addresses, user IDs, error messages with PII High volume, mixed with non-PII technical data
Scanned documents 45K pages/month HIGH — passport scans, signed contracts, medical certificates Requires OCR; variable image quality
Call transcripts 8K transcripts/month HIGH — customers state names, account numbers, personal details Speech-to-text errors, colloquial language
PDF reports 12K documents/month MEDIUM — financial reports may contain customer lists Embedded tables, charts with PII labels

Detection Architecture

Related skills
Installs
1
GitHub Stars
77
First Seen
1 day ago