{"id":5081,"library":"torchtext","title":"TorchText","description":"TorchText is a Python library providing text utilities, models, transforms, and datasets for PyTorch. As of version 0.18.0, released in April 2024, active development on new features has stopped, and it is considered the last stable release, primarily focusing on compatibility with PyTorch 2.3.0 and subsequent patch releases.","status":"maintenance","version":"0.18.0","language":"en","source_language":"en","source_url":"https://github.com/pytorch/text","tags":["pytorch","nlp","text processing","deep learning","datasets","embeddings","data preprocessing"],"install":[{"cmd":"pip install torchtext","lang":"bash","label":"Latest stable release"}],"dependencies":[{"reason":"TorchText is built on PyTorch and requires a compatible version.","package":"torch","optional":false}],"imports":[{"note":"Modern approach for tokenization.","symbol":"get_tokenizer","correct":"from torchtext.data.utils import get_tokenizer"},{"note":"Modern approach for vocabulary creation.","symbol":"build_vocab_from_iterator","correct":"from torchtext.vocab import build_vocab_from_iterator"},{"note":"For directly instantiating a vocabulary object.","symbol":"Vocab","correct":"from torchtext.vocab import Vocab"},{"note":"Example of importing a built-in dataset.","symbol":"AG_NEWS","correct":"from torchtext.datasets import AG_NEWS"},{"note":"For common text-processing transformations.","symbol":"transforms","correct":"from torchtext import transforms"},{"note":"For pre-trained models like T5 or RoBERTa.","symbol":"models","correct":"from torchtext import models"},{"note":"The `Field` class (and `Iterator`/`BucketIterator`) were part of the legacy API. They coupled tokenizer, vocabulary, split, batching, and numericalization into a 'black box', which was replaced by a more modular approach. Legacy components are now found under `torchtext.legacy.data`.","wrong":"from torchtext.data import Field","symbol":"Field","correct":"from torchtext.legacy import data as legacy_data"}],"quickstart":{"code":"import torch\nfrom torchtext.datasets import AG_NEWS\nfrom torchtext.data.utils import get_tokenizer\nfrom torchtext.vocab import build_vocab_from_iterator\nfrom torch.utils.data import DataLoader\n\ndef yield_tokens(data_iter, tokenizer):\n    for _, text in data_iter:\n        yield tokenizer(text)\n\ndef collate_batch(batch, vocab, tokenizer):\n    label_list, text_list, offsets = [], [], [0]\n    for (_label, _text) in batch:\n        label_list.append(int(_label) - 1)\n        processed_text = torch.tensor(vocab(tokenizer(_text)), dtype=torch.int64)\n        text_list.append(processed_text)\n        offsets.append(processed_text.size(0))\n    label_list = torch.tensor(label_list, dtype=torch.int64)\n    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)\n    text_list = torch.cat(text_list)\n    return label_list, text_list, offsets\n\n# 1. Access the raw dataset iterators\ntrain_iter = AG_NEWS(split='train')\ntest_iter = AG_NEWS(split='test')\n\n# 2. Prepare data processing pipelines\ntokenizer = get_tokenizer('basic_english')\n\n# Build vocabulary\nvocab = build_vocab_from_iterator(\n    yield_tokens(train_iter, tokenizer),\n    min_freq=1,\n    specials=['<unk>']\n)\nvocab.set_default_index(vocab['<unk>'])\n\n# Re-initialize iterators for vocabulary (if needed, or use a cached list)\ntrain_iter = AG_NEWS(split='train')\ntest_iter = AG_NEWS(split='test')\n\n# Create a partial function for collate_batch with vocab and tokenizer\ncurrent_collate_batch = lambda batch: collate_batch(batch, vocab, tokenizer)\n\n# 3. Generate data batch and iterator with DataLoader\nBATCH_SIZE = 64\n\ntrain_dataloader = DataLoader(\n    list(train_iter), # Convert to list for Map-style dataset behavior\n    batch_size=BATCH_SIZE,\n    shuffle=True,\n    collate_fn=current_collate_batch\n)\n\ntest_dataloader = DataLoader(\n    list(test_iter), # Convert to list for Map-style dataset behavior\n    batch_size=BATCH_SIZE,\n    shuffle=False,\n    collate_fn=current_collate_batch\n)\n\n# Example usage:\nfor i, (labels, texts, offsets) in enumerate(train_dataloader):\n    if i == 0:\n        print(f\"Batch {i+1}:\")\n        print(f\"  Labels: {labels}\")\n        print(f\"  Texts (concatenated token IDs): {texts}\")\n        print(f\"  Offsets (start index of each text in 'texts'): {offsets}\")\n        break","lang":"python","description":"This quickstart demonstrates the modern TorchText API for text classification. It covers accessing a raw dataset, building a vocabulary, defining text processing pipelines using `get_tokenizer` and `build_vocab_from_iterator`, and finally using `torch.utils.data.DataLoader` with a custom `collate_fn` for batching, padding, and numericalization."},"warnings":[{"fix":"Users should plan to transition to alternative NLP libraries for active development or use TorchText 0.18.0 as a stable, but no longer actively developed, base. For ongoing development, consider PyTorch's native `torch.utils.data` components combined with custom text processing.","message":"TorchText development has stopped, and the 0.18 release is announced as the last stable release. No new features are anticipated, and the library is in maintenance mode.","severity":"breaking","affected_versions":"0.16.0 onwards (announced in 0.16.0, confirmed in 0.18.0)"},{"fix":"Replace `Field` with explicit steps involving `get_tokenizer`, `build_vocab_from_iterator`, and a custom `collate_fn` for `torch.utils.data.DataLoader`. Legacy components are available under `torchtext.legacy.data` if temporary compatibility is needed.","message":"The legacy `torchtext.data.Field` and `Iterator` API was replaced with a more modular approach to align with `torch.utils.data.DataLoader`. This change provides clearer, more flexible components for tokenization, vocabulary, and batching.","severity":"breaking","affected_versions":"0.9.0 onwards"},{"fix":"Always install `torchtext` after verifying compatibility with your `torch` version. Refer to the official compatibility matrix or release notes; for example, TorchText 0.18.0 is compatible with PyTorch 2.3.0.","message":"TorchText releases are tightly coupled with specific PyTorch versions. Installing an incompatible `torch` and `torchtext` version can lead to installation failures or runtime errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If a dataset fails to load, check TorchText's GitHub issues for updated URLs or alternative download methods. Users may need to manually download and process data, or consider using other dataset libraries.","message":"Some built-in datasets may have outdated or broken download URLs, making them inaccessible (e.g., Multi30k was reported in older release notes).","severity":"gotcha","affected_versions":"All versions, specific to certain datasets"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}