{"id":10287,"library":"tensorflow-transform","title":"TensorFlow Transform","description":"TensorFlow Transform (TFT) is a library for preprocessing data with TensorFlow. It allows users to define a preprocessing function that is applied to raw data *before* training, and then export this function as a TensorFlow graph that can be used for *inference*. This ensures consistency between training and serving. It's often used in conjunction with Apache Beam for distributed processing and is a key component of TensorFlow Extended (TFX). The current version is 1.17.0, following TensorFlow's release cadence with frequent updates.","status":"active","version":"1.17.0","language":"en","source_language":"en","source_url":"https://github.com/tensorflow/transform","tags":["tensorflow","preprocessing","ml","data-transformation","apache-beam","tfx"],"install":[{"cmd":"pip install tensorflow-transform","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core TensorFlow library required for graph definition and execution.","package":"tensorflow","optional":false},{"reason":"Used for distributed data processing when running transformations at scale.","package":"apache-beam","optional":false}],"imports":[{"symbol":"tensorflow_transform","correct":"import tensorflow_transform as tft"},{"symbol":"tensorflow_transform.tf_utils","correct":"import tensorflow_transform.tf_utils as tf_utils"},{"symbol":"tensorflow_transform.beam.impl","correct":"from tensorflow_transform.beam import impl as beam_impl"},{"note":"The metadata module was moved under `tf_metadata` in later versions.","wrong":"from tensorflow_transform.metadata import dataset_metadata","symbol":"DatasetMetadata","correct":"from tensorflow_transform.tf_metadata import dataset_metadata"}],"quickstart":{"code":"import tensorflow as tf\nimport tensorflow_transform as tft\nfrom tensorflow_transform.tf_metadata import dataset_metadata\nfrom tensorflow_transform.tf_metadata import schema_utils\nimport apache_beam as beam\nfrom apache_beam.runners.direct import direct_runner\n\n# 1. Define the schema of the raw data\n_RAW_DATA_FEATURE_SPEC = {\n    'x': tf.io.FixedLenFeature([], tf.float32),\n    'y': tf.io.FixedLenFeature([], tf.string),\n    's': tf.io.FixedLenFeature([], tf.float32, default_value=0.0)\n}\n_RAW_DATA_METADATA = dataset_metadata.DatasetMetadata(\n    schema_utils.schema_from_feature_spec(_RAW_DATA_FEATURE_SPEC))\n\n# 2. Define the preprocessing function\ndef preprocessing_fn(inputs):\n    \"\"\"Preprocesses raw inputs into transformed features.\"\"\"\n    outputs = {}\n    outputs['x_scaled'] = tft.scale_to_z_score(inputs['x'])\n    outputs['y_one_hot'] = tft.one_hot(\n        tft.string_to_int(inputs['y'], vocab_filename='vocab_y'), num_buckets=3\n    ) # Assuming max 3 unique values for y\n    outputs['s_identity'] = inputs['s'] # Pass through\n    return outputs\n\n# 3. Prepare some raw data\nraw_data = [\n    {'x': 10.0, 'y': 'apple', 's': 1.0},\n    {'x': 20.0, 'y': 'banana', 's': 2.0},\n    {'x': 30.0, 'y': 'apple', 's': 3.0},\n    {'x': 40.0, 'y': 'orange', 's': 4.0},\n    {'x': 50.0, 'y': 'banana', 's': 5.0},\n]\n\n# 4. Run the transform locally using Apache Beam DirectRunner\nwith beam.Pipeline(runner=direct_runner.DirectRunner()) as p:\n    # Create a PCollection of raw data\n    raw_data_pcollection = (\n        p\n        | 'CreateRawData' >> beam.Create(raw_data)\n    )\n\n    # Apply the transform: Analyze (compute stats) and Transform (apply changes)\n    transformed_data_pcollection, transform_fn = (\n        (raw_data_pcollection, _RAW_DATA_METADATA)\n        | 'AnalyzeAndTransform' >> tft.beam.AnalyzeAndTransformDataset(preprocessing_fn)\n    )\n\n    # Collect and print the transformed data\n    print('Transformed data:')\n    _ = (\n        transformed_data_pcollection\n        | 'PrintTransformedData' >> beam.Map(print)\n    )\n\nprint('Preprocessing complete.')","lang":"python","description":"This quickstart demonstrates how to use `tensorflow-transform` to preprocess a small dataset locally. It defines a `preprocessing_fn` to scale numerical features and one-hot encode categorical features, then applies it using Apache Beam's DirectRunner."},"warnings":[{"fix":"Ensure your `preprocessing_fn` exclusively uses TensorFlow 2.x APIs. Avoid `tf.compat.v1` and ops that require a `tf.Session`.","message":"TensorFlow 2.x Compatibility: The `preprocessing_fn` passed to `AnalyzeAndTransformDataset` is traced into a TensorFlow graph, and for TFT versions >= 1.0, it runs in a TF2 context. Mixing TF1 `tf.compat.v1` APIs or session-based operations directly within `preprocessing_fn` can lead to errors.","severity":"breaking","affected_versions":">=1.0.0"},{"fix":"Design `preprocessing_fn` to be a pure TensorFlow graph definition. Use `tft` APIs for analyzers (e.g., `tft.scale_to_z_score`, `tft.string_to_int`) which handle the two-pass logic. Avoid stateful Python logic or non-TensorFlow operations that are not part of the `preprocessing_fn`'s graph construction.","message":"Two-Pass Transformation Model: TFT operates in two phases: 'Analyze' and 'Transform'. The 'Analyze' phase computes statistics (e.g., min/max for scaling, vocabulary for string-to-int) over the entire dataset. The 'Transform' phase then applies these computed statistics to individual data points. New users often misunderstand that `preprocessing_fn` is traced and executed as a graph, not a simple row-wise Python function.","severity":"gotcha","affected_versions":"*"},{"fix":"For production, familiarize yourself with Apache Beam's programming model and specific runner configurations. Test with a small subset of data on your chosen distributed runner before scaling up. Pay attention to serialization, data formats (e.g., TFRecord), and error handling in a distributed context.","message":"Apache Beam Integration for Scale: While TFT can run locally with Beam's `DirectRunner`, its primary use case is distributed processing with other Beam runners (e.g., Dataflow, Flink, Spark). Misconfiguring Beam runners, I/O connectors, or managing large datasets can be a source of errors and performance bottlenecks.","severity":"gotcha","affected_versions":"*"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Ensure `tensorflow` and `tensorflow-transform` versions are compatible. Verify file paths are correct and accessible by the running user/service. For cloud storage, ensure appropriate authentication and permissions are set for the Beam runner.","cause":"This error often occurs when TensorFlow's `tf.io.gfile` (or `tf.compat.v2.io.gfile`) is used in a context where the underlying filesystem (e.g., local, GCS) is not properly initialized or accessible, or if the `tensorflow` version is mismatched with `tensorflow-transform` requirements.","error":"AttributeError: module 'tensorflow.compat.v2.io.gfile' has no attribute 'Exists'"},{"fix":"Use `tf.io.FixedLenFeature` with `default_value` when defining your `_RAW_DATA_FEATURE_SPEC` to handle missing values gracefully. Alternatively, use `tf.where` or `tf.cond` to handle `None` or empty tensors explicitly within your `preprocessing_fn`.","cause":"This typically happens inside `preprocessing_fn` if a feature is expected to be present but is missing in some input records, leading to a `None` value being passed to a TensorFlow operation that expects a `Tensor`.","error":"TypeError: unsupported operand type(s) for +: 'Tensor' and 'NoneType'"},{"fix":"Ensure the `AnalyzeAndTransformDataset` pipeline completes successfully, generating the `transform_fn` and the associated vocabulary. When serving, load the `transform_fn` correctly and ensure all assets (including vocabularies) are present and accessible in the exported `TransformGraph`.","cause":"This error often occurs when using `tft.string_to_int` or other vocabulary-based transformations, and the underlying lookup table (built from the vocabulary generated during the 'Analyze' phase) is not properly initialized before the 'Transform' phase or when attempting inference.","error":"tf.errors.FailedPreconditionError: Table not initialized."}]}