{"id":8633,"library":"seqio","title":"SeqIO","description":"SeqIO is a Python library by Google for creating task-based datasets, preprocessing pipelines, and evaluation for sequence models. It integrates deeply with T5, Gin-config, and TensorFlow/JAX/PyTorch backends, providing a flexible framework for machine learning research, particularly in NLP. The current version is 0.0.20, and it's under active development with frequent minor releases.","status":"active","version":"0.0.20","language":"en","source_language":"en","source_url":"https://github.com/google/seqio","tags":["nlp","sequence-models","tensorflow","jax","data-pipeline","t5","machine-learning","google"],"install":[{"cmd":"pip install seqio","lang":"bash","label":"Base install"},{"cmd":"pip install seqio[tf]","lang":"bash","label":"Install with TensorFlow backend"},{"cmd":"pip install seqio[jax]","lang":"bash","label":"Install with JAX backend"}],"dependencies":[{"reason":"Common backend for sequence models and data pipelines","package":"tensorflow","optional":true},{"reason":"Often used in conjunction with SeqIO for T5 model training","package":"t5","optional":true},{"reason":"Used for declarative configuration of tasks and models","package":"gin-config","optional":false},{"reason":"Google's Python Common Libraries, used internally","package":"absl-py","optional":false},{"reason":"Fundamental numerical computing library","package":"numpy","optional":false},{"reason":"Alternative backend for high-performance numerical computing","package":"jax","optional":true}],"imports":[{"symbol":"seqio","correct":"import seqio"},{"symbol":"Task","correct":"from seqio import Task"},{"symbol":"Mixture","correct":"from seqio import Mixture"},{"symbol":"FunctionDataSource","correct":"from seqio import FunctionDataSource"},{"symbol":"Feature","correct":"from seqio import Feature"},{"symbol":"Vocabulary","correct":"from seqio import Vocabulary"},{"symbol":"preprocessors","correct":"from seqio import preprocessors"},{"symbol":"get_mixture_or_task","correct":"from seqio import get_mixture_or_task"}],"quickstart":{"code":"import seqio\nimport tensorflow as tf\nimport functools\n\n# 1. Define a minimal mock vocabulary (required for seqio.Feature)\nclass SimpleVocabulary(seqio.Vocabulary):\n    def _encode(self, s): return [ord(c) for c in s] # Simple char to int\n    def _decode(self, ids): return \"\".join([chr(i) for i in ids]) # Simple int to char\n    @property\n    def EOS_ID(self): return 1\n    @property\n    def vocab_size(self): return 256 # ASCII range\n\n# 2. Define a data source function that returns a tf.data.Dataset\ndef my_data_source_fn(split, shuffle_files=False):\n    if split == \"train\":\n        return tf.data.Dataset.from_tensor_slices({\n            \"inputs\": [\"hello world\", \"python is fun\"],\n            \"targets\": [\"olleh dlrow\", \"nohtyp si nuf\"] # Simple reverse task\n        })\n    raise ValueError(f\"Unknown split: {split}\")\n\n# 3. Define a simple preprocessor (converts string to integer IDs)\n@seqio.map_over_dataset_fn\ndef tokenize_example(example):\n    return {\n        \"inputs\": tf.constant([ord(c) for c in example[\"inputs\"].numpy().decode()], dtype=tf.int32),\n        \"targets\": tf.constant([ord(c) for c in example[\"targets\"].numpy().decode()], dtype=tf.int32),\n    }\n\n# 4. Register the task with SeqIO\nseqio.Task.make_task(\n    name=\"simple_reverse_task\",\n    source=seqio.FunctionDataSource(\n        dataset_fn=my_data_source_fn,\n        splits=[\"train\"]\n    ),\n    preprocessors=[\n        tokenize_example,\n        functools.partial(seqio.preprocessors.trim_and_pad, \n                          output_features={\"inputs\": 20, \"targets\": 20}),\n        seqio.preprocessors.append_eos_after_trim,\n    ],\n    output_features={\n        \"inputs\": seqio.Feature(vocabulary=SimpleVocabulary(), add_eos=True),\n        \"targets\": seqio.Feature(vocabulary=SimpleVocabulary(), add_eos=True)\n    }\n)\n\n# 5. Retrieve the task and get its processed dataset\ntask = seqio.get_mixture_or_task(\"simple_reverse_task\")\nds = task.get_dataset(\n    sequence_length={\"inputs\": 20, \"targets\": 20}, # Max sequence length for features\n    split=\"train\",\n    shuffle=False\n)\n\n# 6. Iterate through an example to verify\nfor ex in ds.take(1):\n    print(\"\\n--- Processed Example ---\")\n    print(\"Raw features:\", {k: v.numpy() for k, v in ex.items()})\n    \n    decoded_inputs = task.output_features[\"inputs\"].vocabulary.decode(ex[\"inputs\"].numpy())\n    decoded_targets = task.output_features[\"targets\"].vocabulary.decode(ex[\"targets\"].numpy())\n    print(f\"Decoded inputs: '{decoded_inputs}'\")\n    print(f\"Decoded targets: '{decoded_targets}'\")\n","lang":"python","description":"This quickstart demonstrates how to define a custom task in SeqIO, including a data source function, a preprocessor to convert data into integer IDs, and a mock vocabulary. It registers the task and then retrieves a processed `tf.data.Dataset` for inspection. This setup forms the basis for training sequence models."},"warnings":[{"fix":"Pin your `seqio` version to a specific `0.0.x` release in `requirements.txt`. Review changelogs when upgrading.","message":"SeqIO is in version 0.0.x, indicating an unstable API. Breaking changes can occur frequently between minor versions. Always consult the GitHub releases for changes.","severity":"breaking","affected_versions":"0.0.0 - 0.0.20"},{"fix":"Familiarize yourself with `gin-config` basics. When troubleshooting, ensure all necessary components are either explicitly configured via `gin.bind_parameter` or covered by `gin.parse_config_file`.","message":"Deep integration with `gin-config` can make initial setup complex. Many SeqIO components are `@gin.configurable`, requiring users to understand `gin` configuration patterns.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure you install `seqio` with the appropriate backend: `pip install seqio[tf]`, `pip install seqio[jax]`, or `pip install seqio[torch]`.","message":"SeqIO requires a specific backend (TensorFlow, JAX, or PyTorch). A bare `pip install seqio` often results in missing functionality. You must install with `[tf]`, `[jax]`, or `[torch]` extras.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure your `DataSource` function returns `tf.data.Dataset` of dictionaries with keys like 'inputs' and 'targets'. Preprocessors then convert these to integer IDs.","message":"SeqIO expects input data as `tf.data.Dataset` where each element is a dictionary of features, typically with string values for 'inputs' and 'targets' before tokenization.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure the `splits` list in your `seqio.FunctionDataSource` (or equivalent) includes all splits you intend to use, e.g., `splits=[\"train\", \"validation\"]`.","cause":"The specified split ('validation') was not listed in the `splits` argument when defining the `seqio.FunctionDataSource` or `seqio.TfdsDataSource` for the task.","error":"ValueError: No dataset found for split 'validation' for task 'my_task'"},{"fix":"Upgrade `tensorflow` to a compatible version (usually `>=2.9.0` for `seqio`) and ensure `t5` (if used) is also up-to-date. Check `seqio`'s `setup.py` for exact `tensorflow` requirements.","cause":"This often indicates a version mismatch between `tensorflow` and `seqio`/`t5` dependencies. Specific `tf.lookup` functions or modules might have moved or been deprecated.","error":"AttributeError: module 'tensorflow' has no attribute 'lookup'"},{"fix":"Verify that all classes and functions intended for Gin configuration are decorated with `@gin.configurable`. Ensure the Python module containing these configurables is imported before any Gin configuration takes place.","cause":"This error from `gin-config` means that a class or function expected to be configurable by Gin was not decorated with `@gin.configurable` or imported correctly before `gin.parse_config_file` or `gin.enter_interactive_mode` was called.","error":"gin.config.exceptions.InvalidConfigError: Unreachable configurable 'MyClass' for value"},{"fix":"When working with `tf.Tensor` within preprocessors, especially for string manipulation, remember to convert the tensor to a NumPy array and then decode it to a Python string: `tensor.numpy().decode('utf-8')`. Convert back to `tf.Tensor` before returning.","cause":"This typically occurs in a preprocessor function when attempting to iterate directly over a `tf.Tensor` that represents a scalar or a single element, or when string operations are performed on a `tf.Tensor` instead of its decoded `.numpy()` value.","error":"TypeError: 'tf.Tensor' object is not iterable"}]}