Great Expectations
Great Expectations (GX) is an open-source Python library for data quality. It helps data teams validate, document, and profile their data to ensure quality and consistency throughout data pipelines. It allows users to define 'Expectations' (assertions about data), run validation tests, and generate human-readable data quality reports called 'Data Docs'. The library is actively maintained with frequent releases and supports Python versions 3.10 through 3.13, with experimental support for 3.14.
Warnings
- breaking Breaking changes were introduced in the transition from V0 to V1 API and V2 to V3 API, requiring significant updates to configuration files (e.g., `expectation_suite_name` to `name`, `evaluation_parameters` to `suite_parameters`, `ge_cloud_id` to `id`). Validation Operators were deprecated in V3.
- gotcha Windows support for the open-source Python version (GX OSS) is currently limited or unavailable. Users in Windows environments might encounter errors or performance issues.
- gotcha When validating data from SQL data sources, it can be challenging to retrieve specific row identifiers (e.g., primary keys or row numbers) for failed expectations directly in the validation results. This often requires switching to a Pandas-based execution engine to obtain more granular details.
- gotcha In complex data pipelines, particularly when integrating with orchestrators like Airflow, users have reported issues with Expectations executing multiple times or experiencing slow performance.
Install
-
pip install great_expectations
Imports
- gx
import great_expectations as gx
- get_context
context = gx.get_context()
Quickstart
import great_expectations as gx
import pandas as pd
import os
# 1. Initialize a Data Context (or use an existing one)
# For quickstart, a temporary in-memory context is often sufficient
# For persistent configuration, run `great_expectations init` in your terminal
context = gx.get_context()
# 2. Connect to data (using a Pandas DataFrame for simplicity)
# This example uses a publicly available CSV dataset
# In a real scenario, you'd load your own data, e.g., from a file, database, or API
df = pd.read_csv("https://raw.githubusercontent.com/great-expectations/great_expectations/develop/tests/test_sets/taxi_trips.csv")
# Add a Pandas Datasource and a Data Asset
datasource = context.data_sources.add_pandas("my_pandas_datasource")
data_asset = datasource.add_dataframe_asset(name="my_dataframe_asset", dataframe=df)
# Get a Validator to create and run Expectations
validator = context.get_validator(batch_request=data_asset.build_batch_request())
# 3. Create Expectations
# Define assertions about your data
validator.expect_column_to_exist("passenger_count")
validator.expect_column_values_to_be_between("passenger_count", min_value=1, max_value=6)
validator.expect_column_values_to_not_be_null("pickup_datetime")
# 4. Save the Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)
# 5. Run validation
checkpoint = context.add_or_update_checkpoint(
name="my_checkpoint",
validator=validator,
)
checkpoint_result = checkpoint.run()
# 6. Review validation results (e.g., in Data Docs)
# To open Data Docs in your browser, uncomment the line below after a successful run
# context.build_data_docs()
# context.open_data_docs()
print("Validation successful:", checkpoint_result.success)
if not checkpoint_result.success:
print("Validation failed. Check Data Docs for details.")