BlingFire
BlingFire is a Python wrapper for a lightning-fast Finite State Machine (FSM) based Natural Language Processing (NLP) library developed by Microsoft. It is designed for high-performance text tokenization, multi-word expression matching, stemming, and lemmatization. Known for its speed, it often outperforms other NLP libraries like Hugging Face and SpaCy in tokenization tasks. The library supports various tokenization algorithms including pattern-based, WordPiece, Unigram LM, and BPE. The current version is 0.1.8, and it maintains an active release cadence with periodic updates adding new features and models.
Warnings
- breaking In version 0.1.7, the internal offset for the 'dummy prefix' (a special token sometimes added during tokenization) was fixed to always be -1. If previous code relied on a different offset behavior, this change could potentially break existing logic.
- gotcha While `text_to_words` and `text_to_sentences` use default internal models, advanced tokenization (e.g., BERT, GPT-2, BPE, Unigram LM) requires explicitly loading pre-trained model files (typically `.bin` files) using `load_model`. These model files must be downloaded separately from the BlingFire GitHub repository or other sources.
- gotcha The `IdsToText` API, introduced in v0.1.8, is used to convert token IDs back to text. This functionality relies on the loaded model having internal ID-to-word mappings (`m_hasI2w`). Not all BlingFire models may support this feature, leading to errors if used with an incompatible model.
- gotcha Earlier versions of BlingFire and its default models were primarily optimized for languages using space as a main token delimiter, with limited or no support for East Asian languages (e.g., Chinese, Japanese, Korean, Thai). While newer specialized models (like XLM-R) have improved multilingual support, general-purpose tokenization might still have limitations for non-space-delimited scripts.
Install
-
pip install blingfire
Imports
- text_to_words
from blingfire import text_to_words
- text_to_sentences
from blingfire import text_to_sentences
- load_model
from blingfire import load_model
- tokenize_with_model
from blingfire import tokenize_with_model
- IdsToText
from blingfire import IdsToText
Quickstart
from blingfire import text_to_words, text_to_sentences
text = "After reading this post, you will know: What natural language is. This is a test. How are you?"
sentences = text_to_sentences(text)
print("Sentences:")
# BlingFire returns a single string with sentences separated by newline
print(sentences.split('\n'))
words = text_to_words(text)
print("\nWords:")
# BlingFire returns a single string with words separated by space
print(words.split(' '))