SeqIO Nightly

0.0.18.dev20250227 · active · verified Thu Apr 16

SeqIO is a Google-developed Python library for building scalable data pipelines for sequence models, leveraging `tf.data.Dataset`. It simplifies task-based datasets, preprocessing, and evaluation, offering compatibility with frameworks like JAX or PyTorch via NumPy iterators. This `seqio-nightly` package provides the bleeding-edge, actively developed version of the library. It is a refactoring of the `t5.data` library.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart defines a simple SeqIO Task with a `FunctionDataSource` that yields two text examples. It then uses `seqio.get_dataset` to retrieve and print the first three processed examples. This demonstrates the basic steps of defining data sources, tasks, and obtaining a `tf.data.Dataset`.

import seqio
from seqio.dataset_providers import FunctionDataSource
from seqio.vocabularies import PassThroughVocabulary
import tensorflow as tf

def my_text_generator():
    yield {'text': 'hello world'}
    yield {'text': 'seqio example'}

# Define a data source
my_data_source = FunctionDataSource(
    dataset_fn=lambda split, shuffle: tf.data.Dataset.from_generator(
        my_text_generator,
        output_signature={
            'text': tf.TensorSpec(shape=(), dtype=tf.string)}
        ),
    splits=["train"],
    caching_permitted=False
)

# Register a task (or define it directly)
seqio.Task.make_module(
    "my_simple_task",
    source=my_data_source,
    preprocessors=[], # No preprocessing for simplicity
    output_features={
        "text": seqio.Feature(vocabulary=PassThroughVocabulary(), add_eos=False)
    },
    metric_fns=[]
)

# Get the dataset
dataset = seqio.get_dataset(
    task_or_mixture_name="my_simple_task",
    split="train",
    sequence_length={
        "text": 32 # Example sequence length
    },
    shuffle=False,
    seed=0
)

print("First 3 examples from the dataset:")
for i, example in enumerate(dataset.take(3)):
    print(f"Example {i}: {example['text'].numpy().decode('utf-8')}")

view raw JSON →