TensorFlow Metadata
TensorFlow Metadata (TFMD) provides standard representations for metadata that are useful when training machine learning models with TensorFlow. This includes formats for describing tabular data schemas (e.g., `tf.Examples`), collections of summary statistics over datasets, and problem statements. It is a foundational library used by other TensorFlow Extended (TFX) components like TensorFlow Data Validation (TFDV) and ML Metadata (MLMD). The library is actively maintained, with version 1.17.3 being the current release.
Warnings
- breaking Frequent and critical dependency conflicts with the `protobuf` library. `tensorflow-metadata` often pins `protobuf` to specific major/minor versions, which can clash with other libraries in the TensorFlow ecosystem.
- deprecated Support for Python 3.8 was deprecated starting from version 1.15.0.
- gotcha Nightly builds of `tensorflow-metadata` (and related TF projects) are explicitly stated to be unstable and prone to breakages, with fixes potentially taking a week or more.
- breaking Version 1.15.0 introduced a semantic change to how `min/max/avg/tot num-values` are calculated for nested features, now relying on the innermost level.
- breaking The field `NaturalLanguageDomain.location_constraint_regex` was removed in version 1.15.0. It was previously documented as 'please do not use' and was never fully implemented.
Install
-
pip install tensorflow-metadata
Imports
- Schema
from tf_metadata.proto.v0 import schema_pb2
- Statistics
from tf_metadata.proto.v0 import statistics_pb2
- MetadataStore
from ml_metadata.metadata_store import metadata_store_pb2
Quickstart
from tf_metadata.proto.v0 import schema_pb2
# Create a simple schema definition
schema = schema_pb2.Schema()
# Add a feature named 'age' of type INT
feature_age = schema.feature.add()
feature_age.name = "age"
feature_age.type = schema_pb2.FeatureType.INT
feature_age.int_domain.is_categorical = False
feature_age.presence.min_fraction = 1.0 # 'age' must always be present
feature_age.int_domain.min = 0
feature_age.int_domain.max = 120
# Add a feature named 'city' of type BYTES (string), which is categorical
feature_city = schema.feature.add()
feature_city.name = "city"
feature_city.type = schema_pb2.FeatureType.BYTES
feature_city.string_domain.is_categorical = True
feature_city.string_domain.value.extend(["New York", "London", "Tokyo"])
print("Generated Schema (protobuf format):")
print(schema)
# Serialize the schema to bytes
serialized_schema = schema.SerializeToString()
print(f"\nSerialized Schema (bytes): {len(serialized_schema)} bytes")
# Deserialize the schema back from bytes
deserialized_schema = schema_pb2.Schema()
deserialized_schema.ParseFromString(serialized_schema)
print("\nDeserialized Schema:")
print(deserialized_schema)