{"id":4027,"library":"grain","title":"Grain (ML Data Library)","description":"Grain is a Python library from Google designed for efficiently loading and transforming data, primarily for machine learning model training and evaluation, particularly with JAX. It emphasizes flexibility, speed, and determinism in data processing pipelines. The library is actively developed, currently at version 0.2.16, with frequent updates including new features, bug fixes, and deprecations.","status":"active","version":"0.2.16","language":"en","source_language":"en","source_url":"https://github.com/google/grain","tags":["machine learning","data processing","JAX","data pipeline","ETL","data loading"],"install":[{"cmd":"pip install grain","lang":"bash","label":"Install stable release"}],"dependencies":[{"reason":"Common dependency for data manipulation.","package":"numpy","optional":false},{"reason":"Google's Python Abseil library, often used in Google projects.","package":"absl-py","optional":false},{"reason":"Primary target framework, though Grain does not strictly require JAX to run.","package":"jax","optional":true},{"reason":"For reading data from ArrayRecord format.","package":"array-record","optional":true},{"reason":"For reading data from Parquet files via ParquetIterDataset.","package":"pyarrow","optional":true},{"reason":"For integrating with TensorFlow Datasets.","package":"tensorflow-datasets","optional":true},{"reason":"For asynchronous checkpointing of data loading state.","package":"orbax-checkpoint","optional":true}],"imports":[{"symbol":"MapDataset","correct":"import grain\ndataset = grain.MapDataset.source([...])"},{"note":"MultiprocessPrefetchIterDataset and ConcatenateMapDataset were deprecated in 0.2.16; use `IterDataset.mp_prefetch` or `MapDataset.concatenate` instead.","wrong":"from grain.python.experimental import MultiprocessPrefetchIterDataset","symbol":"IterDataset","correct":"import grain\niter_dataset = dataset.to_iter_dataset()"}],"quickstart":{"code":"import grain\n\ndataset = (\n    grain.MapDataset.source([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])\n    .shuffle(seed=42) # Shuffles elements globally.\n    .map(lambda x: x + 1) # Maps each element.\n    .batch(batch_size=2) # Batches consecutive elements.\n)\n\nprint(\"Processing dataset:\")\nfor batch in dataset:\n    print(batch)","lang":"python","description":"This example demonstrates how to create a simple `MapDataset` from a list, apply common transformations like shuffling, mapping, and batching, and then iterate through the processed data. It showcases the declarative chaining API for data pipeline construction."},"warnings":[{"fix":"Update `__getitem__` methods in custom `RandomAccessDataSource` implementations to accept an `int` argument: `def __getitem__(self, index: int):`","message":"Custom implementations of `RandomAccessDataSource` must now accept an `int` index in `__getitem__`. While legacy paths handling `SupportsIndex` still work at runtime, type checkers may flag errors. Switch to `int` for full compatibility.","severity":"breaking","affected_versions":"0.2.16 and later"},{"fix":"Upgrade your Python environment to 3.11 or newer.","message":"Support for Python 3.10 has been deprecated, and the library now requires Python >=3.11.","severity":"deprecated","affected_versions":"0.2.14 and later"},{"fix":"Migrate usage from `grain.python.experimental.MultiprocessPrefetchIterDataset` to `grain.IterDataset.mp_prefetch` and from `grain.python.experimental.ConcatenateMapDataset` to `grain.MapDataset.concatenate`.","message":"Experimental APIs `grain.python.experimental.MultiprocessPrefetchIterDataset` and `grain.python.experimental.ConcatenateMapDataset` have been deprecated. Use their graduated versions `grain.IterDataset.mp_prefetch` and `grain.MapDataset.concatenate` respectively.","severity":"deprecated","affected_versions":"0.2.16 and later"},{"fix":"Ensure that any custom transformations are defined at the top level of a module or as static methods/free functions, and avoid using complex closures or unpicklable objects within them.","message":"When using Python multiprocessing for parallel data loading and transformations, all custom transformation functions (e.g., `MapTransform` subclasses) must be picklable. Non-picklable objects or closures can lead to errors during serialization.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For random access or debugging, use `grain.MapDataset`. For performance-critical iteration during training, convert to `grain.IterDataset` using `dataset.to_iter_dataset()`.","message":"Choose between `MapDataset` and `IterDataset` based on access patterns. `MapDataset` supports efficient random access and is suitable for debugging or when order-dependent operations are needed. `IterDataset` (often created via `MapDataset.to_iter_dataset()`) is designed for performant, sequential iteration, typically used for training loops, especially with prefetching.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}