BlingFire

0.1.8 · active · verified Sat Apr 11

BlingFire is a Python wrapper for a lightning-fast Finite State Machine (FSM) based Natural Language Processing (NLP) library developed by Microsoft. It is designed for high-performance text tokenization, multi-word expression matching, stemming, and lemmatization. Known for its speed, it often outperforms other NLP libraries like Hugging Face and SpaCy in tokenization tasks. The library supports various tokenization algorithms including pattern-based, WordPiece, Unigram LM, and BPE. The current version is 0.1.8, and it maintains an active release cadence with periodic updates adding new features and models.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic sentence splitting and word tokenization using BlingFire's default models, which do not require explicit model loading. The output is a single string that can be split by newline for sentences or space for words.

from blingfire import text_to_words, text_to_sentences

text = "After reading this post, you will know: What natural language is. This is a test. How are you?"

sentences = text_to_sentences(text)
print("Sentences:")
# BlingFire returns a single string with sentences separated by newline
print(sentences.split('\n'))

words = text_to_words(text)
print("\nWords:")
# BlingFire returns a single string with words separated by space
print(words.split(' '))

view raw JSON →