Data Quality Testing: Ensuring Trust in Your Data
A practical guide to data quality testing with checks for completeness, validity, and freshness—plus examples with Great Expectations to prevent bad data from reaching production.
Data quality issues can have severe business impacts. This guide shows you how to implement robust data quality testing in your pipelines.
Why Data Quality Matters
Poor data quality costs organizations millions annually:
- Bad decisions based on incorrect data
- Customer trust erosion
- Regulatory compliance failures
Data Quality Dimensions
The Six Pillars
| Dimension | Description | Example Check |
|---|---|---|
| Completeness | No missing values | df.isnull().sum() |
| Accuracy | Correct values | Range validation |
| Consistency | Uniform format | Schema checks |
| Timeliness | Up-to-date | Freshness checks |
| Uniqueness | No duplicates | Key validation |
| Validity | Conforms to rules | Business rules |
Implementing Data Tests
Using Great Expectations
import great_expectations as ge
df_ge = ge.from_pandas(df)
# Completeness test
df_ge.expect_column_values_to_not_be_null('customer_id')
# Accuracy test
df_ge.expect_column_values_to_be_between(
'amount',
min_value=0,
max_value=1000000
)
# Uniqueness test
df_ge.expect_column_values_to_be_unique('transaction_id')
Custom Validators
def validate_email(df, column):
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
invalid = ~df[column].str.match(pattern)
if invalid.any():
raise ValueError(f"Invalid emails found: {invalid.sum()}")
Integrating with Pipelines
Add quality checks to your Building Data Pipelines with Python: A Complete Guide:
class QualityAwarePipeline(DataPipeline):
def run(self):
df = self.extract()
self.validate(df) # Quality gate
df = self.transform(df)
self.validate(df) # Post-transform check
self.load(df)
Monitoring and Alerting
Set up continuous monitoring. See Monitoring Data Pipelines: Observability Best Practices for detailed implementation.
Conclusion
Data quality is not optional. Build quality checks into every pipeline from day one.
Quality data drives quality decisions.