Data Quality Testing: Ensuring Trust in Your Data

A practical guide to data quality testing with checks for completeness, validity, and freshness—plus examples with Great Expectations to prevent bad data from reaching production.

Data quality issues can have severe business impacts. This guide shows you how to implement robust data quality testing in your pipelines.

Why Data Quality Matters

Poor data quality costs organizations millions annually:

  • Bad decisions based on incorrect data
  • Customer trust erosion
  • Regulatory compliance failures

Data Quality Dimensions

The Six Pillars

DimensionDescriptionExample Check
CompletenessNo missing valuesdf.isnull().sum()
AccuracyCorrect valuesRange validation
ConsistencyUniform formatSchema checks
TimelinessUp-to-dateFreshness checks
UniquenessNo duplicatesKey validation
ValidityConforms to rulesBusiness rules

Implementing Data Tests

Using Great Expectations

import great_expectations as ge

df_ge = ge.from_pandas(df)

# Completeness test
df_ge.expect_column_values_to_not_be_null('customer_id')

# Accuracy test
df_ge.expect_column_values_to_be_between(
    'amount',
    min_value=0,
    max_value=1000000
)

# Uniqueness test
df_ge.expect_column_values_to_be_unique('transaction_id')

Custom Validators

def validate_email(df, column):
    pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
    invalid = ~df[column].str.match(pattern)
    if invalid.any():
        raise ValueError(f"Invalid emails found: {invalid.sum()}")

Integrating with Pipelines

Add quality checks to your Building Data Pipelines with Python: A Complete Guide:

class QualityAwarePipeline(DataPipeline):
    def run(self):
        df = self.extract()
        self.validate(df)  # Quality gate
        df = self.transform(df)
        self.validate(df)  # Post-transform check
        self.load(df)

Monitoring and Alerting

Set up continuous monitoring. See Monitoring Data Pipelines: Observability Best Practices for detailed implementation.

Conclusion

Data quality is not optional. Build quality checks into every pipeline from day one.


Quality data drives quality decisions.