{"id":3827,"library":"tensorflow-datasets","title":"TensorFlow Datasets","description":"TensorFlow Datasets (TFDS) is a library that provides a comprehensive collection of ready-to-use datasets for machine learning frameworks like TensorFlow, JAX, and PyTorch. It handles the complexities of downloading, preparing, and constructing data into `tf.data.Dataset` or `np.array` objects in a deterministic manner, enabling easy-to-use and high-performance input pipelines. The library maintains an active release cadence, with stable versions typically released every few months, alongside daily nightly builds.","status":"active","version":"4.9.9","language":"en","source_language":"en","source_url":"https://github.com/tensorflow/datasets","tags":["tensorflow","datasets","machine-learning","data-processing","jax","pytorch"],"install":[{"cmd":"pip install tensorflow-datasets","lang":"bash","label":"Stable release"},{"cmd":"pip install tfds-nightly","lang":"bash","label":"Daily nightly build"}],"dependencies":[{"reason":"Often used in conjunction, but not a strict dependency for reading datasets since v4.9.0. Required for `tf.data.Dataset` operations if not using a TF-less path.","package":"tensorflow","optional":true},{"reason":"Required for distributed dataset generation and certain large datasets. Version 4.9.9 pins it to `<2.65.0`.","package":"apache-beam","optional":true}],"imports":[{"symbol":"tfds","correct":"import tensorflow_datasets as tfds"},{"note":"Commonly imported when working with TensorFlow backend, though TFDS can now be used without it for reading datasets.","symbol":"tf","correct":"import tensorflow as tf"}],"quickstart":{"code":"import tensorflow_datasets as tfds\nimport tensorflow as tf\n\n# Load the MNIST dataset\n# It will download and prepare the dataset if not already present.\n(ds_train, ds_test), info = tfds.load(\n    'mnist',\n    split=['train', 'test'],\n    shuffle_files=True,\n    as_supervised=True,  # Returns (image, label) tuples\n    with_info=True\n)\n\n# Build your input pipeline\nds_train = ds_train.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)\nds_test = ds_test.batch(32).prefetch(tf.data.AUTOTUNE)\n\n# Iterate and print a sample\nprint(f\"Dataset info: {info.name} version {info.version}\")\nfor image, label in ds_train.take(1):\n    print(f\"Sample image shape: {image.shape}, label: {label.numpy()}\")","lang":"python","description":"This quickstart demonstrates how to load the MNIST dataset using `tfds.load()`, retrieve training and testing splits, and configure a basic TensorFlow `tf.data.Dataset` input pipeline. It also shows how to inspect dataset metadata and iterate through a sample batch."},"warnings":[{"fix":"Review code that loads Hugging Face datasets and explicitly handle `None` values or convert `min` values if the old `0` behavior is desired.","message":"Starting with v4.9.3, the handling of `None` values for int and float features from `HuggingfaceDatasetBuilder` changed. Instead of converting to `0` or `0.0`, `None` values are now converted to `np.iinfo(dtype).min` or `np.finfo(dtype).min` respectively. This change aligns with NumPy's default behavior for minimum values but can break code relying on the previous `0` default.","severity":"breaking","affected_versions":">=4.9.3"},{"fix":"For non-TensorFlow users, ensure you only use `tfds.as_numpy()` or PyTorch/JAX specific integrations. For TensorFlow users, ensure `tensorflow` is installed if using `tf.data` pipelines, even if not strictly required by TFDS for basic dataset loading.","message":"Version 4.9.0 introduced native support for JAX and PyTorch, making TensorFlow an optional dependency for *reading* datasets. While this enables a 'TensorFlow-less' path, existing codebases heavily integrated with TensorFlow might need review if aiming to leverage TFDS without a full TensorFlow installation, as some functionalities (e.g., `tf.data.Dataset` operations) still depend on TensorFlow.","severity":"breaking","affected_versions":">=4.9.0"},{"fix":"Consider using a virtual environment to manage `apache-beam` versions specific to your `tensorflow-datasets` project, or explicitly downgrade `apache-beam` if conflicts arise.","message":"Version 4.9.9 pins the `apache-beam` dependency to `<2.65.0` due to internal test fixes. Users with newer versions of `apache-beam` installed globally or in other projects might encounter dependency conflicts or unexpected behavior during dataset generation, especially for large datasets that rely on Beam.","severity":"gotcha","affected_versions":"4.9.9"},{"fix":"If deterministic order is required, do not use the `--nondeterministic_order` flag. Implement explicit shuffling in your `tf.data` pipeline if randomized order is needed for training, rather than relying on generation-time shuffling.","message":"The `NoShuffleBeamWriter` introduced in v4.9.8, enabled by the `--nondeterministic_order` flag, significantly speeds up dataset generation by omitting shuffling. However, this explicitly removes deterministic order guarantees. If reproducible data order is critical for your experiments or debugging, avoid this flag or manually re-shuffle.","severity":"gotcha","affected_versions":">=4.9.8"},{"fix":"Consult the official `CroissantBuilder` documentation for the updated API and adjust your dataset generation scripts accordingly.","message":"The API for `CroissantBuilder` (used for generating TFDS datasets from Croissant metadata files) underwent changes in v4.9.7. Code interacting with this specific builder for dataset creation will likely require updates.","severity":"deprecated","affected_versions":">=4.9.7"},{"fix":"Always specify `split` when loading or iterate over the dictionary to access specific splits. Use `as_supervised=True` for supervised learning tasks to get `(feature, label)` tuples, or `tfds.as_numpy()` to convert to NumPy arrays.","message":"By default, `tfds.load()` without specifying the `split` argument returns a dictionary of `tf.data.Dataset` objects (e.g., `{'train': ..., 'test': ...}`). Users often expect direct access to data and might forget to select a split (e.g., `split='train'`) or use `as_supervised=True` for `(features, label)` tuples or `tfds.as_numpy()` for NumPy arrays.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}