SeqIO
SeqIO is a Python library by Google for creating task-based datasets, preprocessing pipelines, and evaluation for sequence models. It integrates deeply with T5, Gin-config, and TensorFlow/JAX/PyTorch backends, providing a flexible framework for machine learning research, particularly in NLP. The current version is 0.0.20, and it's under active development with frequent minor releases.
Common errors
-
ValueError: No dataset found for split 'validation' for task 'my_task'
cause The specified split ('validation') was not listed in the `splits` argument when defining the `seqio.FunctionDataSource` or `seqio.TfdsDataSource` for the task.fixEnsure the `splits` list in your `seqio.FunctionDataSource` (or equivalent) includes all splits you intend to use, e.g., `splits=["train", "validation"]`. -
AttributeError: module 'tensorflow' has no attribute 'lookup'
cause This often indicates a version mismatch between `tensorflow` and `seqio`/`t5` dependencies. Specific `tf.lookup` functions or modules might have moved or been deprecated.fixUpgrade `tensorflow` to a compatible version (usually `>=2.9.0` for `seqio`) and ensure `t5` (if used) is also up-to-date. Check `seqio`'s `setup.py` for exact `tensorflow` requirements. -
gin.config.exceptions.InvalidConfigError: Unreachable configurable 'MyClass' for value
cause This error from `gin-config` means that a class or function expected to be configurable by Gin was not decorated with `@gin.configurable` or imported correctly before `gin.parse_config_file` or `gin.enter_interactive_mode` was called.fixVerify that all classes and functions intended for Gin configuration are decorated with `@gin.configurable`. Ensure the Python module containing these configurables is imported before any Gin configuration takes place. -
TypeError: 'tf.Tensor' object is not iterable
cause This typically occurs in a preprocessor function when attempting to iterate directly over a `tf.Tensor` that represents a scalar or a single element, or when string operations are performed on a `tf.Tensor` instead of its decoded `.numpy()` value.fixWhen working with `tf.Tensor` within preprocessors, especially for string manipulation, remember to convert the tensor to a NumPy array and then decode it to a Python string: `tensor.numpy().decode('utf-8')`. Convert back to `tf.Tensor` before returning.
Warnings
- breaking SeqIO is in version 0.0.x, indicating an unstable API. Breaking changes can occur frequently between minor versions. Always consult the GitHub releases for changes.
- gotcha Deep integration with `gin-config` can make initial setup complex. Many SeqIO components are `@gin.configurable`, requiring users to understand `gin` configuration patterns.
- gotcha SeqIO requires a specific backend (TensorFlow, JAX, or PyTorch). A bare `pip install seqio` often results in missing functionality. You must install with `[tf]`, `[jax]`, or `[torch]` extras.
- gotcha SeqIO expects input data as `tf.data.Dataset` where each element is a dictionary of features, typically with string values for 'inputs' and 'targets' before tokenization.
Install
-
pip install seqio -
pip install seqio[tf] -
pip install seqio[jax]
Imports
- seqio
import seqio
- Task
from seqio import Task
- Mixture
from seqio import Mixture
- FunctionDataSource
from seqio import FunctionDataSource
- Feature
from seqio import Feature
- Vocabulary
from seqio import Vocabulary
- preprocessors
from seqio import preprocessors
- get_mixture_or_task
from seqio import get_mixture_or_task
Quickstart
import seqio
import tensorflow as tf
import functools
# 1. Define a minimal mock vocabulary (required for seqio.Feature)
class SimpleVocabulary(seqio.Vocabulary):
def _encode(self, s): return [ord(c) for c in s] # Simple char to int
def _decode(self, ids): return "".join([chr(i) for i in ids]) # Simple int to char
@property
def EOS_ID(self): return 1
@property
def vocab_size(self): return 256 # ASCII range
# 2. Define a data source function that returns a tf.data.Dataset
def my_data_source_fn(split, shuffle_files=False):
if split == "train":
return tf.data.Dataset.from_tensor_slices({
"inputs": ["hello world", "python is fun"],
"targets": ["olleh dlrow", "nohtyp si nuf"] # Simple reverse task
})
raise ValueError(f"Unknown split: {split}")
# 3. Define a simple preprocessor (converts string to integer IDs)
@seqio.map_over_dataset_fn
def tokenize_example(example):
return {
"inputs": tf.constant([ord(c) for c in example["inputs"].numpy().decode()], dtype=tf.int32),
"targets": tf.constant([ord(c) for c in example["targets"].numpy().decode()], dtype=tf.int32),
}
# 4. Register the task with SeqIO
seqio.Task.make_task(
name="simple_reverse_task",
source=seqio.FunctionDataSource(
dataset_fn=my_data_source_fn,
splits=["train"]
),
preprocessors=[
tokenize_example,
functools.partial(seqio.preprocessors.trim_and_pad,
output_features={"inputs": 20, "targets": 20}),
seqio.preprocessors.append_eos_after_trim,
],
output_features={
"inputs": seqio.Feature(vocabulary=SimpleVocabulary(), add_eos=True),
"targets": seqio.Feature(vocabulary=SimpleVocabulary(), add_eos=True)
}
)
# 5. Retrieve the task and get its processed dataset
task = seqio.get_mixture_or_task("simple_reverse_task")
ds = task.get_dataset(
sequence_length={"inputs": 20, "targets": 20}, # Max sequence length for features
split="train",
shuffle=False
)
# 6. Iterate through an example to verify
for ex in ds.take(1):
print("\n--- Processed Example ---")
print("Raw features:", {k: v.numpy() for k, v in ex.items()})
decoded_inputs = task.output_features["inputs"].vocabulary.decode(ex["inputs"].numpy())
decoded_targets = task.output_features["targets"].vocabulary.decode(ex["targets"].numpy())
print(f"Decoded inputs: '{decoded_inputs}'")
print(f"Decoded targets: '{decoded_targets}'")