ai-training-data-class
Installation
SKILL.md
Sensitive Data Classification for AI/ML Training Datasets
Overview
AI and machine learning models trained on personal data raise distinct classification challenges. Training data may contain direct personal data, inferred special categories, proxy variables for protected characteristics, and data whose consent scope does not extend to model training. The EU AI Act (Regulation (EU) 2024/1689) imposes additional requirements for high-risk AI systems, including data governance obligations under Art. 10 that intersect with GDPR classification requirements. This skill provides a framework for classifying training data, detecting bias-relevant features, documenting data provenance, and verifying consent coverage.
GDPR and AI Act Intersection
GDPR Requirements for Training Data
| GDPR Article | Application to AI Training |
|---|---|
| Art. 5(1)(b) — Purpose limitation | Training a model is a distinct processing purpose; if data was collected for customer service, using it for ML training requires a compatible purpose assessment or new lawful basis |
| Art. 5(1)(c) — Data minimisation | Training datasets must not include more personal data than necessary for the model objective |
| Art. 6 — Lawful basis | Model training requires its own lawful basis; legitimate interests (Art. 6(1)(f)) is most common, but requires LIA documentation |
| Art. 9 — Special categories | If training data contains or enables inference of special category data, Art. 9(2) condition required |
| Art. 22 — Automated decision-making | If the trained model makes decisions with legal or significant effects, additional safeguards apply |
| Art. 25 — Data protection by design | Classification of training data is a by-design measure enabling appropriate technical protections |
| Art. 35 — DPIA | High-risk AI processing (profiling, automated decision-making) requires DPIA |
Related skills