Wordninja
Wordninja is a Python library that probabilistically splits concatenated words based on English Wikipedia uni-gram frequencies. It is designed to segment strings like 'imateapot' into ['im', 'a', 'teapot']. The current version is 2.0.0, released in August 2019, with a focus on stability rather than active new feature development.
Warnings
- gotcha The default `wordninja.split()` function removes punctuation from the input string and does not retain it in the output. For example, 'hello.world' becomes ['hello', 'world'].
- gotcha The probabilistic model can sometimes produce 'over-aggressive' or incorrect splits for specific words or abbreviations, such as 'patreon' splitting into ['pat', 're', 'on'] or 'esg' into ['e', 's', 'g'].
- gotcha The library is primarily designed for English. While custom language models are supported, creating them requires a specific gzipped text file format (one word per line, decreasing order of probability).
Install
-
pip install wordninja
Imports
- split
import wordninja wordninja.split('text')
Quickstart
import wordninja
split_words = wordninja.split('thisisateststring')
print(split_words)
split_phrase = wordninja.split('hellofromtheotherside')
print(split_phrase)