Snowball Stemmer Python Library
This package provides 32 stemmers for 30 languages, generated from the widely-used Snowball algorithms. It is a pure Python implementation, often employed in information retrieval and text processing pipelines for word normalization. Currently at version 3.0.1, the library is actively maintained, providing a lightweight and fast solution for reducing words to their base forms.
Warnings
- gotcha Snowball stemmers are designed for information retrieval, not linguistic correctness. The generated 'stem' is often not a dictionary word or a true lemma. Expecting a grammatically correct root form is a common misconception.
- gotcha Applying the wrong language rules is a common mistake. Each stemmer is language-specific. Using an English stemmer on non-English text, or vice-versa, will yield incorrect results.
- gotcha Stemming can lead to over-stemming (stripping too much, grouping unrelated words) or under-stemming (not stripping enough, failing to group related words) due to its rule-based nature.
- gotcha A `Stemmer` object is not thread-safe if the same object is used concurrently by multiple threads. This can lead to unexpected behavior in concurrent applications.
- gotcha For performance-critical applications, the pure Python `snowballstemmer` can be slower than C-based implementations. A significant speedup can be achieved by installing `PyStemmer`.
Install
-
pip install snowballstemmer
Imports
- stemmer
import snowballstemmer stemmer_obj = snowballstemmer.stemmer('english')
Quickstart
import snowballstemmer
algorithms = snowballstemmer.algorithms()
# print(f"Available stemmers: {', '.join(algorithms)}")
stemmer = snowballstemmer.stemmer('english')
words = ['running', 'runs', 'ran', 'runner', 'unnecessary']
stems = [stemmer.stemWord(word) for word in words]
print(f"Words: {words}")
print(f"Stems: {stems}")
sentence_words = "We are running in the fields and watching runners run.".lower().split()
sentence_stems = stemmer.stemWords(sentence_words)
print(f"Sentence words: {sentence_words}")
print(f"Sentence stems: {sentence_stems}")