PII Detection in Unstructured Data

Overview

Unstructured data — emails, documents, images, chat logs, call transcripts, and system logs — accounts for an estimated 80% of enterprise data and presents the greatest challenge for privacy compliance. Unlike structured databases where personal data resides in known columns, unstructured data contains PII embedded in free text, attached files, scanned images, and metadata. This skill covers detection approaches using Named Entity Recognition (NER), pattern matching, OCR, and hybrid pipelines, with focus on Microsoft Presidio and spaCy as implementation frameworks.

Unstructured Data Sources at Vanguard Financial Services

Source	Volume	PII Risk	Detection Challenge
Email (Exchange Online)	2.1M messages/month	HIGH — names, account numbers, financial data in body and attachments	Mixed text and attachments; forwarded chains contain accumulated PII
SharePoint documents	4.2TB across 1,200 sites	HIGH — contracts, KYC docs, customer correspondence	Multiple formats (docx, pdf, xlsx); embedded images
Teams chat	890K messages/month	MEDIUM — casual references to customers, internal discussions	Short messages, abbreviations, context-dependent PII
Application logs	50GB/day	MEDIUM — IP addresses, user IDs, error messages with PII	High volume, mixed with non-PII technical data
Scanned documents	45K pages/month	HIGH — passport scans, signed contracts, medical certificates	Requires OCR; variable image quality
Call transcripts	8K transcripts/month	HIGH — customers state names, account numbers, personal details	Speech-to-text errors, colloquial language
PDF reports	12K documents/month	MEDIUM — financial reports may contain customer lists	Embedded tables, charts with PII labels

pii-in-unstructured

PII Detection in Unstructured Data

Overview

Unstructured Data Sources at Vanguard Financial Services

Detection Architecture