{"id":9301,"library":"seqio-nightly","title":"SeqIO Nightly","description":"SeqIO is a Google-developed Python library for building scalable data pipelines for sequence models, leveraging `tf.data.Dataset`. It simplifies task-based datasets, preprocessing, and evaluation, offering compatibility with frameworks like JAX or PyTorch via NumPy iterators. This `seqio-nightly` package provides the bleeding-edge, actively developed version of the library. It is a refactoring of the `t5.data` library.","status":"active","version":"0.0.18.dev20250227","language":"en","source_language":"en","source_url":"https://github.com/google/seqio/tree/nightly","tags":["machine-learning","nlp","tensorflow","data-processing","sequence-models","jax","pytorch"],"install":[{"cmd":"pip install seqio-nightly","lang":"bash","label":"Install latest nightly"}],"dependencies":[{"reason":"Core data pipelines are built on `tf.data.Dataset`.","package":"tensorflow","optional":false},{"reason":"For converting `tf.data.Dataset` to NumPy iterators for JAX/PyTorch compatibility.","package":"numpy","optional":false}],"imports":[{"symbol":"Task","correct":"import seqio\nfrom seqio import Task"},{"symbol":"Mixture","correct":"import seqio\nfrom seqio import Mixture"},{"symbol":"FeatureConverter","correct":"import seqio\nfrom seqio import FeatureConverter"},{"symbol":"get_dataset","correct":"import seqio\nseqio.get_dataset(...)"},{"symbol":"FunctionDataSource","correct":"from seqio.dataset_providers import FunctionDataSource"},{"symbol":"PassThroughVocabulary","correct":"from seqio.vocabularies import PassThroughVocabulary"},{"symbol":"preprocessors","correct":"from seqio import preprocessors"}],"quickstart":{"code":"import seqio\nfrom seqio.dataset_providers import FunctionDataSource\nfrom seqio.vocabularies import PassThroughVocabulary\nimport tensorflow as tf\n\ndef my_text_generator():\n    yield {'text': 'hello world'}\n    yield {'text': 'seqio example'}\n\n# Define a data source\nmy_data_source = FunctionDataSource(\n    dataset_fn=lambda split, shuffle: tf.data.Dataset.from_generator(\n        my_text_generator,\n        output_signature={\n            'text': tf.TensorSpec(shape=(), dtype=tf.string)}\n        ),\n    splits=[\"train\"],\n    caching_permitted=False\n)\n\n# Register a task (or define it directly)\nseqio.Task.make_module(\n    \"my_simple_task\",\n    source=my_data_source,\n    preprocessors=[], # No preprocessing for simplicity\n    output_features={\n        \"text\": seqio.Feature(vocabulary=PassThroughVocabulary(), add_eos=False)\n    },\n    metric_fns=[]\n)\n\n# Get the dataset\ndataset = seqio.get_dataset(\n    task_or_mixture_name=\"my_simple_task\",\n    split=\"train\",\n    sequence_length={\n        \"text\": 32 # Example sequence length\n    },\n    shuffle=False,\n    seed=0\n)\n\nprint(\"First 3 examples from the dataset:\")\nfor i, example in enumerate(dataset.take(3)):\n    print(f\"Example {i}: {example['text'].numpy().decode('utf-8')}\")","lang":"python","description":"This quickstart defines a simple SeqIO Task with a `FunctionDataSource` that yields two text examples. It then uses `seqio.get_dataset` to retrieve and print the first three processed examples. This demonstrates the basic steps of defining data sources, tasks, and obtaining a `tf.data.Dataset`."},"warnings":[{"fix":"Refer to the latest GitHub repository for up-to-date API usage. Consider using the stable `seqio` release for more stability.","message":"As a nightly release, `seqio-nightly` is on the bleeding edge of development and may introduce frequent API changes or instabilities without deprecation periods. It is not recommended for production use.","severity":"breaking","affected_versions":"All nightly versions"},{"fix":"Consult the `seqio` documentation and migration guides on the GitHub repository to understand the updated API patterns for tasks, mixtures, and data sources.","message":"`seqio` is a refactor of the `t5.data` library. Users migrating from `t5.data` may encounter API differences.","severity":"deprecated","affected_versions":"All versions (migration from t5.data)"},{"fix":"Always use `default_factory=lambda: ...` for mutable defaults in class definitions, or ensure new instances are created for each feature definition. (e.g., `vocabulary=PassThroughVocabulary()` vs `vocabulary=PassThroughVocabulary` if it were mutable in context).","message":"When defining `seqio.Feature` or other classes, mutable default arguments (e.g., lists, dictionaries, or vocabulary objects directly) can lead to unexpected shared state bugs.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure the `dataset_fn` provided to `FunctionDataSource` correctly accepts and utilizes the `split` and `shuffle` arguments, returning a `tf.data.Dataset` with the expected output signature. Refer to examples for correct `output_signature` definition.","message":"Using `FunctionDataSource` with a `dataset_fn` that incorrectly handles `shuffle` or positional arguments can lead to unexpected data behavior or errors.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Instead of `vocabulary=PassThroughVocabulary()`, use `vocabulary=seqio.Feature(vocabulary=lambda: PassThroughVocabulary(), ...)` or define the default within the `output_features` dictionary as shown in the quickstart, ensuring a fresh instance is created each time.","cause":"Attempting to use a mutable object (like an instance of `PassThroughVocabulary`) as a default value for a field in a class definition directly, which can lead to shared state across instances.","error":"ValueError: mutable default <class 'seqio.vocabularies.PassThroughVocabulary'> for field vocabulary is not allowed: use default_factory"},{"fix":"Modify your `dataset_fn` to accept `split` and `shuffle` as arguments, e.g., `dataset_fn=lambda split, shuffle: ...`. If `shuffle` is not used, it should still be accepted.","cause":"The `dataset_fn` provided to `FunctionDataSource` does not correctly define its signature to accept the `shuffle` argument (and potentially `split`).","error":"TypeError: dataset_fn() got an unexpected keyword argument 'shuffle'"}]}