Tokenizers: Fast State-of-the-Art Tokenizers

0.22.2 verified Tue May 12 auth: no python install: verified quickstart: verified

Tokenizers is a Python library providing fast and versatile tokenization tools, optimized for both research and production environments. The current version is 0.22.2, released on January 5, 2026. The library is actively maintained with regular updates to enhance performance and add features.

pip install tokenizers

Common errors

error ModuleNotFoundError: No module named 'tokenizers' ↓

cause The 'tokenizers' Python package has not been installed in the current environment or the Python interpreter is not using the correct environment where it is installed.

fix

pip install tokenizers

error Failed building wheel for tokenizers ↓

cause The 'tokenizers' library contains Rust components that require a Rust compiler to be installed and accessible in your system's PATH during the pip installation process.

fix

Install Rust (e.g., using curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) and ensure it's in your system's PATH before retrying pip install tokenizers.

error ValueError: Path /path/to/model is not a valid tokenizers model directory. ↓

cause The path provided to `Tokenizer.from_file()` or `Tokenizer.from_pretrained()` does not point to a valid `tokenizer.json` file or a directory containing a valid tokenizer model.

fix

Ensure the specified path points directly to a tokenizer.json file or a directory recognized as a valid model by the tokenizers library, or use a correct Hugging Face model identifier.

error TypeError: Expected a list of strings, received ... ↓

cause The `encode_batch` method (or `encode` when used with multiple inputs) of the `tokenizers.Tokenizer` object received an input that was not a list of strings, or the list contained non-string elements.

fix

Ensure the input to tokenizer.encode_batch() is a list of strings, or for single input, ensure tokenizer.encode() receives a single string. Example: tokenizer.encode_batch(['text1', 'text2']).

error AttributeError: 'Encoding' object has no attribute 'input_ids' ↓

cause When using the `tokenizers` library directly, the `encode()` method returns an `Encoding` object, which has different attribute names (e.g., `ids`, `attention_mask`, `type_ids`) compared to the `transformers.BatchEncoding` object.

fix

Access the token IDs via encoding.ids, attention mask via encoding.attention_mask, and token type IDs via encoding.type_ids. For example, input_ids = encoding.ids.

Warnings

breaking Python 3.13 compatibility issues during installation ↓

fix Use Python 3.12 or earlier for installation; Python 3.13 is not supported due to PyO3 compatibility issues.

gotcha Ensure correct import path to avoid ImportError ↓

fix Use 'from tokenizers import Tokenizer' to import the Tokenizer class.

Install compatibility verified last tested: 2026-05-12

python os / libc status wheel install import disk

3.10 alpine (musl) - - 0.02s 85.0M

3.10 slim (glibc) - - 0.01s 68M

3.11 alpine (musl) - - 0.04s 90.3M

3.11 slim (glibc) - - 0.03s 73M

3.12 alpine (musl) - - 0.03s 81.4M

3.12 slim (glibc) - - 0.04s 64M

3.13 alpine (musl) - - 0.03s 81.0M

3.13 slim (glibc) - - 0.03s 64M

3.9 alpine (musl) - - 0.02s 83.8M

3.9 slim (glibc) - - 0.02s 67M

Imports

Tokenizer
```
from tokenizers import Tokenizer
```
Ensure correct import path to avoid ImportError

Quickstart verified last tested: 2026-04-23

A simple example demonstrating how to load a pretrained tokenizer and tokenize a sample text.

from tokenizers import Tokenizer

# Load a pretrained tokenizer
tokenizer = Tokenizer.from_pretrained('bert-base-uncased')

# Tokenize a text
output = tokenizer.encode('Hello, world!')
print(output.tokens)