{"id":5167,"library":"curated-tokenizers","title":"Curated Tokenizers","description":"Curated Tokenizers is a lightweight Python library by Explosion (creators of spaCy) that provides efficient and production-ready implementations of various piece tokenization algorithms, including Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. It focuses on fast, reliable tokenization suitable for integrating into larger NLP pipelines. The library is currently at version 2.0.0, with an active but less frequent release cadence focused on performance and stability.","status":"active","version":"2.0.0","language":"en","source_language":"en","source_url":"https://github.com/explosion/curated-tokenizers","tags":["tokenization","nlp","byte-pair-encoding","wordpiece","sentencepiece","explosion","spacy"],"install":[{"cmd":"pip install curated-tokenizers","lang":"bash","label":"Base installation (includes ByteBPE and WordPiece)"},{"cmd":"pip install curated-tokenizers[sentencepiece]","lang":"bash","label":"Installation with SentencePiece support"}],"dependencies":[{"reason":"Required for some internal tokenization logic, implicitly installed.","package":"regex","optional":false},{"reason":"Required to use the SentencePieceProcessor. Must be installed as an optional extra.","package":"sentencepiece","optional":true}],"imports":[{"symbol":"ByteBPEProcessor","correct":"from curated_tokenizers import ByteBPEProcessor"},{"symbol":"WordPieceProcessor","correct":"from curated_tokenizers import WordPieceProcessor"},{"note":"All main processors are exposed directly at the top-level package for consistent access.","wrong":"from curated_tokenizers.tokenizers import SentencePieceProcessor","symbol":"SentencePieceProcessor","correct":"from curated_tokenizers import SentencePieceProcessor"}],"quickstart":{"code":"from curated_tokenizers import ByteBPEProcessor\n\n# Create a minimal, in-memory ByteBPE processor for demonstration.\n# In a real application, you would load pre-trained models from files\n# using methods like `ByteBPEProcessor.from_file(vocab_path, merges_path)`.\n\n# Define a simple vocabulary mapping tokens (as bytes) to IDs\ntoken_to_id = {\n    b\"<unk>\": 0, b\"a\": 1, b\"b\": 2, b\"c\": 3, b\"ab\": 4, b\"abc\": 5\n}\n# Reverse mapping from IDs to tokens\nid_to_token = {v: k for k, v in token_to_id.items()}\n\n# Define some merge rules (as tuples of bytes)\nmerges = [\n    (b\"a\", b\"b\"),\n    (b\"ab\", b\"c\")\n]\n\n# Instantiate the ByteBPEProcessor\nprocessor = ByteBPEProcessor(\n    token_to_id=token_to_id,\n    id_to_token=id_to_token,\n    bpe_merges=merges,\n    dropout=0.0, # Use 0.0 for deterministic tokenization\n    unk_id=token_to_id[b\"<unk>\"]\n)\n\ntext = \"abc abc\"\nprint(f\"Original text: '{text}'\")\n\n# Encode the text into a list of integer IDs\nencoded_ids = processor.encode(text)\nprint(f\"Encoded IDs: {encoded_ids}\")\n\n# Decode the IDs back into a string\ndecoded_text = processor.decode_from_ids(encoded_ids)\nprint(f\"Decoded text: '{decoded_text}'\")","lang":"python","description":"This quickstart demonstrates how to instantiate and use a `ByteBPEProcessor` for encoding and decoding text. Note that while this example creates a processor in memory, typical usage involves loading pre-trained models from files using `from_file` methods (e.g., for `vocab.json` and `merges.txt`)."},"warnings":[{"fix":"Change `pip install cutlery` to `pip install curated-tokenizers` and update all `from cutlery import ...` to `from curated_tokenizers import ...`.","message":"The package was renamed from `cutlery` to `curated-tokenizers`. Users migrating from `cutlery` must update their import statements and package names.","severity":"breaking","affected_versions":"<0.0.7"},{"fix":"Install with `pip install curated-tokenizers[sentencepiece]` if you plan to use `SentencePieceProcessor`.","message":"The `SentencePieceProcessor` requires the `sentencepiece` library, which is an optional dependency. It must be installed separately.","severity":"gotcha","affected_versions":"All"},{"fix":"For production or realistic examples, use `Processor.from_file(path_to_model)` and ensure you have the necessary model files.","message":"All piece processors (ByteBPEProcessor, WordPieceProcessor, SentencePieceProcessor) are designed to load pre-trained models from files. Instantiating them directly in memory for simple demos (as done in the quickstart) is possible but often more complex than loading an existing model file.","severity":"gotcha","affected_versions":"All"},{"fix":"Thoroughly test existing tokenization logic after upgrading to v2.0.0 to ensure consistent output, especially for Byte BPE.","message":"Version 2.0.0 primarily introduces performance improvements for Byte BPE encoding. While the API is generally stable, major version bumps can sometimes involve subtle behavioral shifts. Review your existing code for any unexpected changes.","severity":"gotcha","affected_versions":">=2.0.0"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}