Tokenizers: Fast State-of-the-Art Tokenizers
raw JSON → 0.22.2 verified Tue May 12 auth: no python install: verified quickstart: verified
Tokenizers is a Python library providing fast and versatile tokenization tools, optimized for both research and production environments. The current version is 0.22.2, released on January 5, 2026. The library is actively maintained with regular updates to enhance performance and add features.
pip install tokenizers Common errors
error ModuleNotFoundError: No module named 'tokenizers' ↓
cause The 'tokenizers' Python package has not been installed in the current environment or the Python interpreter is not using the correct environment where it is installed.
fix
pip install tokenizers
error Failed building wheel for tokenizers ↓
cause The 'tokenizers' library contains Rust components that require a Rust compiler to be installed and accessible in your system's PATH during the pip installation process.
fix
Install Rust (e.g., using
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh) and ensure it's in your system's PATH before retrying pip install tokenizers. error ValueError: Path /path/to/model is not a valid tokenizers model directory. ↓
cause The path provided to `Tokenizer.from_file()` or `Tokenizer.from_pretrained()` does not point to a valid `tokenizer.json` file or a directory containing a valid tokenizer model.
fix
Ensure the specified path points directly to a
tokenizer.json file or a directory recognized as a valid model by the tokenizers library, or use a correct Hugging Face model identifier. error TypeError: Expected a list of strings, received ... ↓
cause The `encode_batch` method (or `encode` when used with multiple inputs) of the `tokenizers.Tokenizer` object received an input that was not a list of strings, or the list contained non-string elements.
fix
Ensure the input to
tokenizer.encode_batch() is a list of strings, or for single input, ensure tokenizer.encode() receives a single string. Example: tokenizer.encode_batch(['text1', 'text2']). error AttributeError: 'Encoding' object has no attribute 'input_ids' ↓
cause When using the `tokenizers` library directly, the `encode()` method returns an `Encoding` object, which has different attribute names (e.g., `ids`, `attention_mask`, `type_ids`) compared to the `transformers.BatchEncoding` object.
fix
Access the token IDs via
encoding.ids, attention mask via encoding.attention_mask, and token type IDs via encoding.type_ids. For example, input_ids = encoding.ids. Warnings
breaking Python 3.13 compatibility issues during installation ↓
fix Use Python 3.12 or earlier for installation; Python 3.13 is not supported due to PyO3 compatibility issues.
gotcha Ensure correct import path to avoid ImportError ↓
fix Use 'from tokenizers import Tokenizer' to import the Tokenizer class.
Install compatibility verified last tested: 2026-05-12
python os / libc status wheel install import disk
3.10 alpine (musl) - - 0.02s 85.0M
3.10 slim (glibc) - - 0.01s 68M
3.11 alpine (musl) - - 0.04s 90.3M
3.11 slim (glibc) - - 0.03s 73M
3.12 alpine (musl) - - 0.03s 81.4M
3.12 slim (glibc) - - 0.04s 64M
3.13 alpine (musl) - - 0.03s 81.0M
3.13 slim (glibc) - - 0.03s 64M
3.9 alpine (musl) - - 0.02s 83.8M
3.9 slim (glibc) - - 0.02s 67M
Imports
- Tokenizer
from tokenizers import Tokenizer
Quickstart verified last tested: 2026-04-23
from tokenizers import Tokenizer
# Load a pretrained tokenizer
tokenizer = Tokenizer.from_pretrained('bert-base-uncased')
# Tokenize a text
output = tokenizer.encode('Hello, world!')
print(output.tokens)