Data Quality Audit Checklist: Ensure Reliable AI/ML Inputs
High-quality data is the foundation of reliable AI and ML systems. This checklist guides you through defining goals, running targeted checks, and fixing common issues so your models train on trustworthy inputs.
- Define clear objectives and scope before touching datasets.
- Run automated inventory, profiling, and validity tests early.
- Focus on completeness, consistency, timeliness, and duplication.
- Prioritize fixes that reduce downstream model risk and bias.
Define objectives and scope
Start by documenting why you’re auditing the data and what success looks like. Tie the audit to specific model use-cases, KPIs, and regulatory requirements (e.g., fairness, explainability, audit trails).
- Primary objective: e.g., “Reduce label noise for fraud model to <5%."
- Scope: datasets, time range, feature sets, derived fields, and environments (dev/staging/prod).
- Stakeholders: model owners, data engineers, compliance, and product managers.
- Acceptance criteria: thresholds for completeness, accuracy, duplication, and latency.
Quick answer
Perform an inventory, run profiling and validation checks, fix issues prioritized by model impact, and implement monitoring to prevent regressions.
Inventory and profiling checks
Build a complete catalog of datasets, tables, and feature stores. Profile each dataset to capture schema, data types, cardinality, distribution, and basic statistics.
- Record metadata: owner, refresh cadence, source system, downstream consumers.
- Schema snapshot: column names, types, nullability, and constraints.
- Profiling metrics: min/max, mean, median, std, percentiles, unique counts.
- Distribution checks: histograms for numeric features and frequency tables for categoricals.
| Metric | Why it matters |
|---|---|
| Null rate | Signals missingness and downstream imputation needs |
| Unique count / cardinality | Affects encoding and memory use |
| Value range | Detects anomalies or data type issues |
| Top categories | Shows skew and potential bias |
Accuracy and validity tests
Verify that values conform to expected formats, ranges, and reference data. Accuracy tests compare dataset values to authoritative sources or known business rules.
- Format checks: regex for emails, phone numbers, timestamps.
- Range checks: numeric features must fall within logical bounds (e.g., age 0–120).
- Referential integrity: foreign keys link to valid master tables.
- Cross-field validation: birthdate <= record date; loan_amount > 0 when status = approved.
Example rule (concise): IF status='shipped' THEN shipped_date IS NOT NULL.
Completeness and missingness checks
Assess which fields are missing and whether patterns of missingness are random or correlated with outcomes (which can bias models).
- Field-level completeness: percent non-null per column.
- Row-level completeness: percent of rows meeting minimal required fields.
- Block/mask checks: identify systematic missingness by source, partition, or time window.
- Imputation strategy mapping: per-field preferred handling (drop, mean/median, model-based, sentinel).
| Missingness pattern | Suggested action |
|---|---|
| Random small fraction (<1%) | Simple imputation or ignore |
| Systematic by source/time | Root-cause fix and source remediation |
| Correlated with target | Model-aware imputation or feature engineering |
Consistency and duplication detection
Check for conflicting records, duplicate entries, and inconsistent representations (e.g., multiple codes for same category).
- Deduplication keys: choose stable identifiers or composite keys to detect duplicates.
- Versioning vs. append: determine whether duplicates are updates or true duplicates.
- Canonicalization: normalize units, casing, and category mappings (e.g., “US” vs “United States”).
- Conflict resolution policy: keep latest, authoritative source, or merge attributes.
Timeliness and freshness verification
Ensure data freshness meets model and operational needs. Monitor latency from event occurrence to dataset availability.
- Define freshness SLAs per dataset (e.g., near-real-time, daily batch).
- Measure end-to-end latency and percentiles (P50, P95, P99).
- Backfill and replayability: confirm ability to rebuild historic feature values.
- Alerting: trigger when latency or data gaps exceed thresholds.
Common pitfalls and how to avoid them
- Relying on sample-only checks — run full-table scans or representative partitions to catch rare issues.
- Ignoring source-side bugs — instrument ingestion pipelines and add end-to-end checks.
- Treating duplicates as updates without policy — define canonicalization and retention rules.
- Fixing symptoms, not causes — prioritize source fixes over downstream patching when feasible.
- Not monitoring post-deployment drift — implement production data and performance monitors.
Implementation checklist
- Document objectives, scope, stakeholders, and acceptance criteria.
- Create dataset inventory and capture schema/profiling snapshots.
- Implement automated validation rules (format, range, referential) in CI or pipelines.
- Run completeness, duplication, and consistency remediation; log changes and root causes.
- Set freshness SLAs and build latency monitoring with alerts.
- Schedule periodic re-profiling and bias/accuracy rechecks after model training.
FAQ
- How often should I run a data quality audit?
- Run lightweight automated checks continuously and schedule full audits monthly or before major model retraining.
- What tools help automate these checks?
- Use data profiling and validation tools (open-source or commercial) integrated with CI and observability stacks.
- How do I prioritize fixes?
- Prioritize by model impact: issues that affect key features, target leakage, or bias first.
- When should I fix at source vs downstream?
- Prefer source fixes for systemic issues; use downstream patches for transient or one-off cases while source remediation is planned.
- How do I prove data quality improvements?
- Track metrics (null rates, error rates, latency, duplicate counts) over time and link changes to model performance improvements.
