{"id":3910,"library":"blingfire","title":"BlingFire","description":"BlingFire is a Python wrapper for a lightning-fast Finite State Machine (FSM) based Natural Language Processing (NLP) library developed by Microsoft. It is designed for high-performance text tokenization, multi-word expression matching, stemming, and lemmatization. Known for its speed, it often outperforms other NLP libraries like Hugging Face and SpaCy in tokenization tasks. The library supports various tokenization algorithms including pattern-based, WordPiece, Unigram LM, and BPE. The current version is 0.1.8, and it maintains an active release cadence with periodic updates adding new features and models.","status":"active","version":"0.1.8","language":"en","source_language":"en","source_url":"https://github.com/microsoft/blingfire/","tags":["NLP","tokenization","text processing","FSM","Microsoft","performance"],"install":[{"cmd":"pip install blingfire","lang":"bash","label":"Install latest stable version"}],"dependencies":[],"imports":[{"symbol":"text_to_words","correct":"from blingfire import text_to_words"},{"symbol":"text_to_sentences","correct":"from blingfire import text_to_sentences"},{"symbol":"load_model","correct":"from blingfire import load_model"},{"symbol":"tokenize_with_model","correct":"from blingfire import tokenize_with_model"},{"note":"Added in v0.1.8 for converting token IDs back to text; requires models that support ID-to-word mapping.","symbol":"IdsToText","correct":"from blingfire import IdsToText"}],"quickstart":{"code":"from blingfire import text_to_words, text_to_sentences\n\ntext = \"After reading this post, you will know: What natural language is. This is a test. How are you?\"\n\nsentences = text_to_sentences(text)\nprint(\"Sentences:\")\n# BlingFire returns a single string with sentences separated by newline\nprint(sentences.split('\\n'))\n\nwords = text_to_words(text)\nprint(\"\\nWords:\")\n# BlingFire returns a single string with words separated by space\nprint(words.split(' '))","lang":"python","description":"This quickstart demonstrates basic sentence splitting and word tokenization using BlingFire's default models, which do not require explicit model loading. The output is a single string that can be split by newline for sentences or space for words."},"warnings":[{"fix":"Review code that processes token offsets, especially the first token's offset, when using models that might employ a dummy prefix.","message":"In version 0.1.7, the internal offset for the 'dummy prefix' (a special token sometimes added during tokenization) was fixed to always be -1. If previous code relied on a different offset behavior, this change could potentially break existing logic.","severity":"breaking","affected_versions":">=0.1.7"},{"fix":"Ensure required `.bin` model files are available and load them explicitly using `handle = load_model('./path/to/your_model.bin')` before calling functions like `tokenize_with_model` or `IdsToText`.","message":"While `text_to_words` and `text_to_sentences` use default internal models, advanced tokenization (e.g., BERT, GPT-2, BPE, Unigram LM) requires explicitly loading pre-trained model files (typically `.bin` files) using `load_model`. These model files must be downloaded separately from the BlingFire GitHub repository or other sources.","severity":"gotcha","affected_versions":"All"},{"fix":"Verify that the model being used with `IdsToText` explicitly supports ID-to-word conversion. Refer to the model's documentation or origin for compatibility.","message":"The `IdsToText` API, introduced in v0.1.8, is used to convert token IDs back to text. This functionality relies on the loaded model having internal ID-to-word mappings (`m_hasI2w`). Not all BlingFire models may support this feature, leading to errors if used with an incompatible model.","severity":"gotcha","affected_versions":">=0.1.8"},{"fix":"For East Asian or other non-space-delimited languages, explicitly load and use a model specifically trained for that language or script. Consult the BlingFire documentation for available multilingual models.","message":"Earlier versions of BlingFire and its default models were primarily optimized for languages using space as a main token delimiter, with limited or no support for East Asian languages (e.g., Chinese, Japanese, Korean, Thai). While newer specialized models (like XLM-R) have improved multilingual support, general-purpose tokenization might still have limitations for non-space-delimited scripts.","severity":"gotcha","affected_versions":"<0.1.5, All (for default models)"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}