data-quality
Data Quality
Core Concepts
1. Data Quality Dimensions for Financial Data
Six dimensions define data quality. Each has domain-specific meaning in financial services.
Accuracy — Data values correctly represent the real-world entity or event they describe. A security price is accurate if it reflects the actual market closing price or evaluated value from the designated source. A client address is accurate if it matches the client's current legal address of record. Accuracy failures propagate: an inaccurate price produces inaccurate valuations, performance, billing, and regulatory reports. Accuracy is measured by comparing data against an independent authoritative source — cross-vendor price comparison, custodian-to-PMS reconciliation, client confirmation of personal data. In practice, accuracy is the hardest dimension to measure because it requires an independent reference point for comparison.
Completeness — All required data elements are present for every record. A security master record is incomplete if it lacks an ISIN, asset class classification, or pricing source designation. A client onboarding record is incomplete if beneficial ownership for entity accounts is missing. Completeness is measured as the percentage of records with all mandatory fields populated. Financial data completeness requirements are often regulatory: FinCEN requires complete beneficial ownership data, GIPS requires complete portfolio inclusion in composites, SEC Rule 17a-4 requires complete transaction records. Completeness must be defined per record type — a required field for an entity account (beneficial ownership) differs from a required field for an individual account (employment status).
Timeliness — Data is available when needed for its intended use. End-of-day pricing must arrive before the nightly valuation batch runs. Trade confirmations must be generated within SEC Rule 10b-10 timeframes. NAV calculations must complete before fund company deadlines. Timeliness is measured as the lag between event occurrence and data availability in consuming systems. Late data is functionally equivalent to missing data if it arrives after the processing window closes. Timeliness requirements vary dramatically by use case: real-time market data must arrive in milliseconds, EOD pricing within hours, and quarterly regulatory filings within weeks.
Consistency — The same fact is represented identically across all systems and time periods. A client's legal name must match across the CRM, custodian, PMS, and billing system. A security's sector classification must be the same in the portfolio management system and the compliance monitoring system. Inconsistency typically indicates either a missing golden source designation or a broken synchronization process. Consistency is measured by cross-system comparison for the same entity attribute. Temporal consistency also matters: a security's classification should not change retroactively without documented justification and downstream impact assessment.
Validity — Data conforms to defined formats, ranges, and business rules. A CUSIP must be exactly 9 characters with a valid check digit. An account registration type must be one of the firm's defined values. A bond coupon rate cannot be negative (for conventional bonds). A trade settlement date cannot precede the trade date. Validity is enforced through schema constraints, field-level validation, and business rule engines. Invalid data that passes into production indicates insufficient input validation. Validity rules should be versioned and maintained as a formal catalog — when rules change, the change should be documented with effective date and rationale.
Uniqueness — Each real-world entity is represented exactly once. A client appearing as two records in the CRM (duplicate due to name variation or data entry error) causes fragmented reporting, missed household billing discounts, and potential compliance failures (wash sale detection across accounts requires a unified client view). A security represented as two master records per custodian causes duplicated positions. Uniqueness is enforced through deduplication at ingestion and periodic duplicate detection scans. Common deduplication techniques include exact-match on identifiers (SSN, CUSIP), fuzzy matching on names and addresses (Jaro-Winkler, Levenshtein distance), and probabilistic matching combining multiple weak identifiers into a confidence score.