SeqIO Nightly
SeqIO is a Google-developed Python library for building scalable data pipelines for sequence models, leveraging `tf.data.Dataset`. It simplifies task-based datasets, preprocessing, and evaluation, offering compatibility with frameworks like JAX or PyTorch via NumPy iterators. This `seqio-nightly` package provides the bleeding-edge, actively developed version of the library. It is a refactoring of the `t5.data` library.
Common errors
-
ValueError: mutable default <class 'seqio.vocabularies.PassThroughVocabulary'> for field vocabulary is not allowed: use default_factory
cause Attempting to use a mutable object (like an instance of `PassThroughVocabulary`) as a default value for a field in a class definition directly, which can lead to shared state across instances.fixInstead of `vocabulary=PassThroughVocabulary()`, use `vocabulary=seqio.Feature(vocabulary=lambda: PassThroughVocabulary(), ...)` or define the default within the `output_features` dictionary as shown in the quickstart, ensuring a fresh instance is created each time. -
TypeError: dataset_fn() got an unexpected keyword argument 'shuffle'
cause The `dataset_fn` provided to `FunctionDataSource` does not correctly define its signature to accept the `shuffle` argument (and potentially `split`).fixModify your `dataset_fn` to accept `split` and `shuffle` as arguments, e.g., `dataset_fn=lambda split, shuffle: ...`. If `shuffle` is not used, it should still be accepted.
Warnings
- breaking As a nightly release, `seqio-nightly` is on the bleeding edge of development and may introduce frequent API changes or instabilities without deprecation periods. It is not recommended for production use.
- deprecated `seqio` is a refactor of the `t5.data` library. Users migrating from `t5.data` may encounter API differences.
- gotcha When defining `seqio.Feature` or other classes, mutable default arguments (e.g., lists, dictionaries, or vocabulary objects directly) can lead to unexpected shared state bugs.
- gotcha Using `FunctionDataSource` with a `dataset_fn` that incorrectly handles `shuffle` or positional arguments can lead to unexpected data behavior or errors.
Install
-
pip install seqio-nightly
Imports
- Task
import seqio from seqio import Task
- Mixture
import seqio from seqio import Mixture
- FeatureConverter
import seqio from seqio import FeatureConverter
- get_dataset
import seqio seqio.get_dataset(...)
- FunctionDataSource
from seqio.dataset_providers import FunctionDataSource
- PassThroughVocabulary
from seqio.vocabularies import PassThroughVocabulary
- preprocessors
from seqio import preprocessors
Quickstart
import seqio
from seqio.dataset_providers import FunctionDataSource
from seqio.vocabularies import PassThroughVocabulary
import tensorflow as tf
def my_text_generator():
yield {'text': 'hello world'}
yield {'text': 'seqio example'}
# Define a data source
my_data_source = FunctionDataSource(
dataset_fn=lambda split, shuffle: tf.data.Dataset.from_generator(
my_text_generator,
output_signature={
'text': tf.TensorSpec(shape=(), dtype=tf.string)}
),
splits=["train"],
caching_permitted=False
)
# Register a task (or define it directly)
seqio.Task.make_module(
"my_simple_task",
source=my_data_source,
preprocessors=[], # No preprocessing for simplicity
output_features={
"text": seqio.Feature(vocabulary=PassThroughVocabulary(), add_eos=False)
},
metric_fns=[]
)
# Get the dataset
dataset = seqio.get_dataset(
task_or_mixture_name="my_simple_task",
split="train",
sequence_length={
"text": 32 # Example sequence length
},
shuffle=False,
seed=0
)
print("First 3 examples from the dataset:")
for i, example in enumerate(dataset.take(3)):
print(f"Example {i}: {example['text'].numpy().decode('utf-8')}")