PyStemmer
PyStemmer provides efficient access to stemming algorithms from the Snowball project, wrapping the `libstemmer_c` library in a Python module. It's primarily used in information retrieval and search engines to reduce words to their common linguistic base form. The current version is 3.0.0, with an active but irregular release cadence typically driven by updates to the underlying Snowball library or Python compatibility.
Warnings
- gotcha Stemmer objects are not thread-safe if used concurrently by multiple threads. Race conditions can occur.
- gotcha Installing PyStemmer might fail if pre-built wheels are not available for your system and Python version, as it requires a C compiler and Python development header files to build from source.
- breaking Python 2 is no longer actively supported. PyStemmer 2.2.0.1 was the final version tested with Python 2.
- gotcha Input strings are assumed to be Unicode. While `stemWords` can accept UTF-8 encoded byte strings, inconsistencies with other encodings or incorrect handling of Unicode can lead to unexpected stemming results.
Install
-
pip install PyStemmer
Imports
- Stemmer
import Stemmer
Quickstart
import Stemmer
# Get a list of available algorithms
algorithms = Stemmer.algorithms()
# print(algorithms) # Uncomment to see the list
# Get an instance of the English stemmer
stemmer = Stemmer.Stemmer('english')
# Stem a single word
word = 'cycling'
stemmed_word = stemmer.stemWord(word)
print(f"'{word}' stemmed to: '{stemmed_word}'")
# Stem a list of words
words = ['connection', 'connections', 'connective', 'connected', 'connecting']
stemmed_words = stemmer.stemWords(words)
print(f"Words {words} stemmed to: {stemmed_words}")