Snowball Stemmer Python Library

raw JSON →
3.0.1 verified Tue May 12 auth: no python install: verified

This package provides 32 stemmers for 30 languages, generated from the widely-used Snowball algorithms. It is a pure Python implementation, often employed in information retrieval and text processing pipelines for word normalization. Currently at version 3.0.1, the library is actively maintained, providing a lightweight and fast solution for reducing words to their base forms.

pip install snowballstemmer
error ModuleNotFoundError: No module named 'snowballstemmer'
cause The `snowballstemmer` package is not installed in the current Python environment.
fix
pip install snowballstemmer
error AttributeError: 'Stemmer' object has no attribute 'stem'
cause Users often confuse the `stemWord` or `stemWords` methods of `snowballstemmer` with a `stem` method found in other stemming libraries like NLTK.
fix
Use stemmer.stemWord('word') for a single word or stemmer.stemWords(['word1', 'word2']) for a list of words.
error TypeError: Stemmer.__init__() missing 1 required positional argument: 'stemmers'
cause The `Stemmer` class is being instantiated directly with a language name string, but its constructor expects an internal list of stemmer objects; the `snowballstemmer.stemmer()` factory function should be used instead.
fix
import snowballstemmer; my_stemmer = snowballstemmer.stemmer('english')
error TypeError: expected string, list found
cause The `stemWord` method is designed to process a single string argument, but it received a list of words.
fix
Use stemmer.stemWords(['word1', 'word2']) for processing a list of words, or iterate and call stemWord for each string.
gotcha Snowball stemmers are designed for information retrieval, not linguistic correctness. The generated 'stem' is often not a dictionary word or a true lemma. Expecting a grammatically correct root form is a common misconception.
fix Understand that the output is a base form for conflation, not necessarily a dictionary entry. If true lemmas are needed, consider a lemmatization library (e.g., NLTK with WordNet).
gotcha Applying the wrong language rules is a common mistake. Each stemmer is language-specific. Using an English stemmer on non-English text, or vice-versa, will yield incorrect results.
fix Explicitly select the appropriate stemmer for the language of your text (e.g., `snowballstemmer.stemmer('german')`). Implement language detection if processing multilingual content.
gotcha Stemming can lead to over-stemming (stripping too much, grouping unrelated words) or under-stemming (not stripping enough, failing to group related words) due to its rule-based nature.
fix Evaluate the stemming output on representative data and understand its limitations. For higher precision, consider hybrid approaches or lemmatization, especially for irregular forms.
gotcha A `Stemmer` object is not thread-safe if the same object is used concurrently by multiple threads. This can lead to unexpected behavior in concurrent applications.
fix For concurrent stemming in different threads, create a separate `Stemmer` object for each thread. Creating stemmer objects has some cost, but they are re-entrant.
gotcha For performance-critical applications, the pure Python `snowballstemmer` can be slower than C-based implementations. A significant speedup can be achieved by installing `PyStemmer`.
fix Install `PyStemmer` (e.g., `pip install PyStemmer`). The `snowballstemmer` library will automatically detect and utilize `PyStemmer` for faster processing if it's available.
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 0.27s 19.3M
3.10 alpine (musl) - - 0.18s 19.3M
3.10 slim (glibc) wheel 1.6s 0.30s 20M
3.10 slim (glibc) - - 0.36s 20M
3.11 alpine (musl) wheel - 0.43s 21.6M
3.11 alpine (musl) - - 0.43s 21.6M
3.11 slim (glibc) wheel 1.7s 0.40s 22M
3.11 slim (glibc) - - 0.35s 22M
3.12 alpine (musl) wheel - 0.32s 13.4M
3.12 alpine (musl) - - 0.53s 13.4M
3.12 slim (glibc) wheel 1.6s 0.36s 14M
3.12 slim (glibc) - - 0.41s 14M
3.13 alpine (musl) wheel - 0.26s 13.2M
3.13 alpine (musl) - - 0.27s 13.1M
3.13 slim (glibc) wheel 1.6s 0.27s 14M
3.13 slim (glibc) - - 0.29s 14M
3.9 alpine (musl) wheel - 0.05s 18.7M
3.9 alpine (musl) - - 0.05s 18.7M
3.9 slim (glibc) wheel 1.8s 0.04s 19M
3.9 slim (glibc) - - 0.05s 19M

This example demonstrates how to initialize an English stemmer and use it to stem individual words and lists of words. It also shows how to retrieve the list of supported stemming algorithms.

import snowballstemmer

algorithms = snowballstemmer.algorithms()
# print(f"Available stemmers: {', '.join(algorithms)}")

stemmer = snowballstemmer.stemmer('english')
words = ['running', 'runs', 'ran', 'runner', 'unnecessary']
stems = [stemmer.stemWord(word) for word in words]
print(f"Words: {words}")
print(f"Stems: {stems}")

sentence_words = "We are running in the fields and watching runners run.".lower().split()
sentence_stems = stemmer.stemWords(sentence_words)
print(f"Sentence words: {sentence_words}")
print(f"Sentence stems: {sentence_stems}")