{"library":"seqio-nightly","title":"SeqIO Nightly","description":"SeqIO is a Google-developed Python library for building scalable data pipelines for sequence models, leveraging `tf.data.Dataset`. It simplifies task-based datasets, preprocessing, and evaluation, offering compatibility with frameworks like JAX or PyTorch via NumPy iterators. This `seqio-nightly` package provides the bleeding-edge, actively developed version of the library. It is a refactoring of the `t5.data` library.","language":"python","status":"active","last_verified":"Thu Apr 16","install":{"commands":["pip install seqio-nightly"],"cli":null},"imports":["import seqio\nfrom seqio import Task","import seqio\nfrom seqio import Mixture","import seqio\nfrom seqio import FeatureConverter","import seqio\nseqio.get_dataset(...)","from seqio.dataset_providers import FunctionDataSource","from seqio.vocabularies import PassThroughVocabulary","from seqio import preprocessors"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import seqio\nfrom seqio.dataset_providers import FunctionDataSource\nfrom seqio.vocabularies import PassThroughVocabulary\nimport tensorflow as tf\n\ndef my_text_generator():\n    yield {'text': 'hello world'}\n    yield {'text': 'seqio example'}\n\n# Define a data source\nmy_data_source = FunctionDataSource(\n    dataset_fn=lambda split, shuffle: tf.data.Dataset.from_generator(\n        my_text_generator,\n        output_signature={\n            'text': tf.TensorSpec(shape=(), dtype=tf.string)}\n        ),\n    splits=[\"train\"],\n    caching_permitted=False\n)\n\n# Register a task (or define it directly)\nseqio.Task.make_module(\n    \"my_simple_task\",\n    source=my_data_source,\n    preprocessors=[], # No preprocessing for simplicity\n    output_features={\n        \"text\": seqio.Feature(vocabulary=PassThroughVocabulary(), add_eos=False)\n    },\n    metric_fns=[]\n)\n\n# Get the dataset\ndataset = seqio.get_dataset(\n    task_or_mixture_name=\"my_simple_task\",\n    split=\"train\",\n    sequence_length={\n        \"text\": 32 # Example sequence length\n    },\n    shuffle=False,\n    seed=0\n)\n\nprint(\"First 3 examples from the dataset:\")\nfor i, example in enumerate(dataset.take(3)):\n    print(f\"Example {i}: {example['text'].numpy().decode('utf-8')}\")","lang":"python","description":"This quickstart defines a simple SeqIO Task with a `FunctionDataSource` that yields two text examples. It then uses `seqio.get_dataset` to retrieve and print the first three processed examples. This demonstrates the basic steps of defining data sources, tasks, and obtaining a `tf.data.Dataset`.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}