TensorFlow Datasets
TensorFlow Datasets (TFDS) is a library that provides a comprehensive collection of ready-to-use datasets for machine learning frameworks like TensorFlow, JAX, and PyTorch. It handles the complexities of downloading, preparing, and constructing data into `tf.data.Dataset` or `np.array` objects in a deterministic manner, enabling easy-to-use and high-performance input pipelines. The library maintains an active release cadence, with stable versions typically released every few months, alongside daily nightly builds.
Warnings
- breaking Starting with v4.9.3, the handling of `None` values for int and float features from `HuggingfaceDatasetBuilder` changed. Instead of converting to `0` or `0.0`, `None` values are now converted to `np.iinfo(dtype).min` or `np.finfo(dtype).min` respectively. This change aligns with NumPy's default behavior for minimum values but can break code relying on the previous `0` default.
- breaking Version 4.9.0 introduced native support for JAX and PyTorch, making TensorFlow an optional dependency for *reading* datasets. While this enables a 'TensorFlow-less' path, existing codebases heavily integrated with TensorFlow might need review if aiming to leverage TFDS without a full TensorFlow installation, as some functionalities (e.g., `tf.data.Dataset` operations) still depend on TensorFlow.
- gotcha Version 4.9.9 pins the `apache-beam` dependency to `<2.65.0` due to internal test fixes. Users with newer versions of `apache-beam` installed globally or in other projects might encounter dependency conflicts or unexpected behavior during dataset generation, especially for large datasets that rely on Beam.
- gotcha The `NoShuffleBeamWriter` introduced in v4.9.8, enabled by the `--nondeterministic_order` flag, significantly speeds up dataset generation by omitting shuffling. However, this explicitly removes deterministic order guarantees. If reproducible data order is critical for your experiments or debugging, avoid this flag or manually re-shuffle.
- deprecated The API for `CroissantBuilder` (used for generating TFDS datasets from Croissant metadata files) underwent changes in v4.9.7. Code interacting with this specific builder for dataset creation will likely require updates.
- gotcha By default, `tfds.load()` without specifying the `split` argument returns a dictionary of `tf.data.Dataset` objects (e.g., `{'train': ..., 'test': ...}`). Users often expect direct access to data and might forget to select a split (e.g., `split='train'`) or use `as_supervised=True` for `(features, label)` tuples or `tfds.as_numpy()` for NumPy arrays.
Install
-
pip install tensorflow-datasets -
pip install tfds-nightly
Imports
- tfds
import tensorflow_datasets as tfds
- tf
import tensorflow as tf
Quickstart
import tensorflow_datasets as tfds
import tensorflow as tf
# Load the MNIST dataset
# It will download and prepare the dataset if not already present.
(ds_train, ds_test), info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True, # Returns (image, label) tuples
with_info=True
)
# Build your input pipeline
ds_train = ds_train.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
ds_test = ds_test.batch(32).prefetch(tf.data.AUTOTUNE)
# Iterate and print a sample
print(f"Dataset info: {info.name} version {info.version}")
for image, label in ds_train.take(1):
print(f"Sample image shape: {image.shape}, label: {label.numpy()}")