TensorFlow Data Validation

raw JSON →
1.17.0 verified Mon Apr 27 auth: no python

TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It computes descriptive statistics, infers a schema, detects anomalies, and supports data drift/skew detection. Current version is 1.17.0 (requires Python 3.9+), with releases following TensorFlow's cadence.

pip install tensorflow-data-validation
error ModuleNotFoundError: No module named 'tfdv'
cause Importing the library incorrectly as 'tfdv' instead of 'tensorflow_data_validation'.
fix
Use import tensorflow_data_validation as tfdv.
error TypeError: generate_statistics_from_dataframe() got an unexpected keyword argument 'stats_options'
cause Passing `stats_options` as keyword argument incorrectly in older versions or with wrong signature.
fix
Use tfdv.generate_statistics_from_dataframe(dataframe, stats_options=StatsOptions(...)) after importing from tensorflow_data_validation.utils.stats_options.
breaking TFDV 1.0+ changed the API for `generate_statistics_from_csv` and `generate_statistics_from_dataframe`. The old `tfdv.generate_statistics` is deprecated.
fix Use `generate_statistics_from_csv` or `generate_statistics_from_dataframe` instead of `generate_statistics`.
deprecated `tfdv.visualize_statistics` is deprecated in favor of using `tfdv.utils.display_util.display_stats` for Jupyter visualization.
fix Use `from tensorflow_data_validation.utils.display_util import display_stats`.
gotcha TFDV statistics generation may be slow on large datasets; use Apache Beam for distributed processing.
fix Install apache-beam (`pip install apache-beam`) and use `generate_statistics_from_csv` with `beam_pipeline_args`.
gotcha Schema inference from small samples may produce overly strict constraints; use `tfdv.update_schema_with_stats` or manual tuning.
fix Consider using `tfdv.update_schema_with_stats` to relax schema constraints based on full statistics.

Compute statistics from a DataFrame, infer schema, and validate new data.

import tensorflow_data_validation as tfdv
import pandas as pd

data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': ['a', 'b', 'c', 'd', 'e']
})

# Generate statistics
stats = tfdv.generate_statistics_from_dataframe(data)

# Infer schema
schema = tfdv.infer_schema(stats)
print(schema)

# Validate new data
test_data = pd.DataFrame({
    'feature1': [1, 2, 6],
    'feature2': ['x', 'y', 'z']
})
anomalies = tfdv.validate_statistics(
    tfdv.generate_statistics_from_dataframe(test_data),
    schema
)
print(anomalies)