Tutorial Lengkap Great Expectations: Data Quality Testing untuk ML Pipelines
Great Expectations adalah library Python open-source untuk validasi, dokumentasi, dan profiling data. Library ini membantu Anda menjaga kualitas data dengan mendefinisikan "expectations" - assertion tentang data yang bisa ditest secara otomatis.
Mengapa Great Expectations?
Keunggulan Great Expectations:- Data validation: Test kualitas data otomatis
- Documentation: Auto-generated data docs
- Profiling: Automatic expectation generation
- Integration: Works dengan pandas, Spark, SQL
- CI/CD ready: Masuk ke dalam data pipelines
- Data pipeline testing
- ML data validation
- Data migration verification
- ETL quality assurance
- Data contract enforcement
Instalasi
# Basic installation
pip install greatexpectations
Dengan specific backends
pip install greatexpectations[spark]
pip install greatexpectations[sqlalchemy]
Verify installation
greatexpectations --version
Quick Start
1. Initialize Project
# Buat GX project baru
greatexpectations init
Ini membuat:
greatexpectations/
├── checkpoints/
├── expectations/
├── plugins/
├── profilers/
├── uncommitted/
└── greatexpectations.yml
2. Basic Usage dengan Pandas
import greatexpectations as gx
import pandas as pd
Create context
context = gx.getcontext()
Sample data
df = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 40, 45],
'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com',
'david@example.com', 'eve@example.com'],
'salary': [50000, 60000, 70000, 80000, 90000]
})
Create expectation suite
suite = context.addexpectationsuite("mysuite")
Get validator
validator = context.getvalidator(
batchrequest=context.buildbatchrequest(df),
expectationsuitename="mysuite"
)
Tambahkan expectations
validator.expectcolumntoexist("id")
validator.expectcolumnvaluestobeunique("id")
validator.expectcolumnvaluestonotbenull("name")
validator.expectcolumnvaluestobebetween("age", minvalue=18, maxvalue=100)
validator.expectcolumnvaluestomatchregex("email", r"^[\w\.-]+@[\w\.-]+\.\w+$")
Simpan expectations
validator.saveexpectationsuite(discardfailedexpectations=False)
Validate
results = validator.validate()
print(f"Success: {results.success}")
Expectations
1. Column Existence
# Column exists
validator.expectcolumntoexist("columnname")
Columns dalam set
validator.expecttablecolumnstomatchset(
columnset=["id", "name", "email", "createdat"],
exactmatch=False # Allow additional columns
)
Column order
validator.expecttablecolumnstomatchorderedlist(
columnlist=["id", "name", "email", "createdat"]
)
2. Null dan Unique Values
# Not null
validator.expectcolumnvaluestonotbenull("requiredfield")
Allow some nulls
validator.expectcolumnvaluestonotbenull(
"optionalfield",
mostly=0.95 # 95% non-null
)
Unique values
validator.expectcolumnvaluestobeunique("id")
Unique together
validator.expectcompoundcolumnstobeunique(["firstname", "lastname"])
3. Value Ranges
# Between range
validator.expectcolumnvaluestobebetween(
"age",
minvalue=0,
maxvalue=120
)
Greater than
validator.expectcolumnmintobebetween("price", minvalue=0)