RAKE NLTK
RAKE-NLTK is a Python implementation of the Rapid Automatic Keyword Extraction (RAKE) algorithm, leveraging the Natural Language Toolkit (NLTK). It's designed to extract key phrases from text by analyzing word frequency and co-occurrence. The library, currently at version 1.0.6 (released September 2021), provides a straightforward interface for keyword extraction and offers configuration options for tokenizers, stopwords, and ranking metrics. Its release cadence is infrequent, with the last major update in 2021.
Common errors
-
LookupError: <resource> not found. Please use the NLTK Downloader to obtain the resource:
cause The required NLTK corpus (e.g., 'stopwords' or 'punkt') has not been downloaded.fixRun `import nltk; nltk.download('stopwords'); nltk.download('punkt')` in your Python environment. -
error: package directory 'rake_nltk' does not exist
cause Attempting to install `rake-nltk` by cloning the repository and running `python setup.py install`, which can fail due to specific build environment issues or older `pip` versions not handling dependencies correctly during setup.fixUse `pip install rake-nltk` instead of installing from source. Ensure `nltk` is installed if you encounter persistent issues. -
r.get_ranked_phrases() returns an empty list or unexpected results
cause This can happen if NLTK stopwords or punkt tokenizer are not downloaded, or if the input text is too short, lacks significant keywords, or is primarily composed of stop words.fixVerify that `nltk.download('stopwords')` and `nltk.download('punkt')` have been executed. Review your input text for sufficient content and relevant non-stop words.
Warnings
- breaking NLTK data (stopwords and punkt tokenizer) are critical dependencies for `rake-nltk` and must be downloaded separately. Without these, the library will fail with a `LookupError`.
- gotcha Installing `rake-nltk` directly from a cloned GitHub repository using `python setup.py install` can sometimes lead to an `error: package directory 'rake_nltk' does not exist`, especially in older `pip` versions or specific build environments. This is often related to how NLTK dependencies or post-install hooks are handled during the build process.
Install
-
pip install rake-nltk
Imports
- Rake
from rake_nltk import Rake
Quickstart
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from rake_nltk import Rake
text = """Compatibility of systems of diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types of systems and systems of mixed types."""
r = Rake()
r.extract_keywords_from_text(text)
ranked_phrases = r.get_ranked_phrases()
ranked_phrases_with_scores = r.get_ranked_phrases_with_scores()
print("Top 5 ranked phrases:")
for phrase in ranked_phrases[:5]:
print(f"- {phrase}")
print("\nTop 5 ranked phrases with scores:")
for score, phrase in ranked_phrases_with_scores[:5]:
print(f"- {phrase} (Score: {score:.2f})")