TensorFlow Data Validation
raw JSON → 1.17.0 verified Mon Apr 27 auth: no python
TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It computes descriptive statistics, infers a schema, detects anomalies, and supports data drift/skew detection. Current version is 1.17.0 (requires Python 3.9+), with releases following TensorFlow's cadence.
pip install tensorflow-data-validation Common errors
error ModuleNotFoundError: No module named 'tfdv' ↓
cause Importing the library incorrectly as 'tfdv' instead of 'tensorflow_data_validation'.
fix
Use
import tensorflow_data_validation as tfdv. error TypeError: generate_statistics_from_dataframe() got an unexpected keyword argument 'stats_options' ↓
cause Passing `stats_options` as keyword argument incorrectly in older versions or with wrong signature.
fix
Use
tfdv.generate_statistics_from_dataframe(dataframe, stats_options=StatsOptions(...)) after importing from tensorflow_data_validation.utils.stats_options. Warnings
breaking TFDV 1.0+ changed the API for `generate_statistics_from_csv` and `generate_statistics_from_dataframe`. The old `tfdv.generate_statistics` is deprecated. ↓
fix Use `generate_statistics_from_csv` or `generate_statistics_from_dataframe` instead of `generate_statistics`.
deprecated `tfdv.visualize_statistics` is deprecated in favor of using `tfdv.utils.display_util.display_stats` for Jupyter visualization. ↓
fix Use `from tensorflow_data_validation.utils.display_util import display_stats`.
gotcha TFDV statistics generation may be slow on large datasets; use Apache Beam for distributed processing. ↓
fix Install apache-beam (`pip install apache-beam`) and use `generate_statistics_from_csv` with `beam_pipeline_args`.
gotcha Schema inference from small samples may produce overly strict constraints; use `tfdv.update_schema_with_stats` or manual tuning. ↓
fix Consider using `tfdv.update_schema_with_stats` to relax schema constraints based on full statistics.
Imports
- tensorflow_data_validation wrong
import tfdvcorrectimport tensorflow_data_validation as tfdv - StatsOptions
from tensorflow_data_validation.utils.stats_options import StatsOptions
Quickstart
import tensorflow_data_validation as tfdv
import pandas as pd
data = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5],
'feature2': ['a', 'b', 'c', 'd', 'e']
})
# Generate statistics
stats = tfdv.generate_statistics_from_dataframe(data)
# Infer schema
schema = tfdv.infer_schema(stats)
print(schema)
# Validate new data
test_data = pd.DataFrame({
'feature1': [1, 2, 6],
'feature2': ['x', 'y', 'z']
})
anomalies = tfdv.validate_statistics(
tfdv.generate_statistics_from_dataframe(test_data),
schema
)
print(anomalies)