Complete Great Expectations Tutorial: Data Quality Testing for ML Pipelines
Great Expectations is an open-source Python library for validating, documenting, and profiling your data. It helps you maintain data quality by defining "expectations" - assertions about your data that can be automatically tested.
Why Great Expectations?
Great Expectations Advantages:- Data validation: Test data quality automatically
- Documentation: Auto-generated data docs
- Profiling: Automatic expectation generation
- Integration: Works with pandas, Spark, SQL
- CI/CD ready: Fits into data pipelines
- Data pipeline testing
- ML data validation
- Data migration verification
- ETL quality assurance
- Data contract enforcement
Installation
# Basic installation
pip install greatexpectations
With specific backends
pip install greatexpectations[spark]
pip install greatexpectations[sqlalchemy]
Verify installation
greatexpectations --version
Quick Start
1. Initialize Project
# Create a new GX project
greatexpectations init
This creates:
greatexpectations/
├── checkpoints/
├── expectations/
├── plugins/
├── profilers/
├── uncommitted/
└── greatexpectations.yml
2. Basic Usage with Pandas
import greatexpectations as gx
import pandas as pd
Create context
context = gx.getcontext()
Sample data
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 40, 45],
'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com',
'david@example.com', 'eve@example.com'],
'salary': [50000, 60000, 70000, 80000, 90000]
})
Create expectation suite
suite = context.addexpectationsuite("mysuite")
Get validator
validator = context.getvalidator(
batchrequest=context.buildbatchrequest(df),
expectationsuitename="mysuite"
)
Add expectations
validator.expectcolumntoexist("id")
validator.expectcolumnvaluestobeunique("id")
validator.expectcolumnvaluestonotbenull("name")
validator.expectcolumnvaluestobebetween("age", minvalue=18, maxvalue=100)
validator.expectcolumnvaluestomatchregex("email", r"^[\w\.-]+@[\w\.-]+\.\w+$")
Save expectations
validator.saveexpectationsuite(discardfailedexpectations=False)
Validate
results = validator.validate()
print(f"Success: {results.success}")
Expectations
1. Column Existence
# Column exists
validator.expectcolumntoexist("columnname")
Columns in set
validator.expecttablecolumnstomatchset(
columnset=["id", "name", "email", "createdat"],
exactmatch=False # Allow additional columns
)
Column order
validator.expecttablecolumnstomatchorderedlist(
columnlist=["id", "name", "email", "createdat"]
)
2. Null and Unique Values
# Not null
validator.expectcolumnvaluestonotbenull("requiredfield")
Allow some nulls
validator.expectcolumnvaluestonotbenull(
"optionalfield",
mostly=0.95 # 95% non-null
)
Unique values
validator.expectcolumnvaluestobeunique("id")
Unique together
validator.expectcompoundcolumnstobeunique(["firstname", "lastname"])
3. Value Ranges
# Between range
validator.expectcolumnvaluestobebetween(
"age",
minvalue=0,
maxvalue=120
)
Greater than
validator.expectcolumnmintobebetween("price", minvalue=0)