YAKE! Keyword Extraction
YAKE! (Yet Another Keyword Extractor) is a lightweight, unsupervised Python library for automatic keyword extraction. It identifies the most relevant keywords from a document using statistical text features, without requiring training data, external corpora, or dictionaries, and supports multiple languages. Currently at version 0.7.3, YAKE! maintains an active development pace with recent updates focusing on performance and adding lemmatization capabilities.
Common errors
-
ModuleNotFoundError: No module named 'yake'
cause The 'yake' package is not installed or the Python interpreter cannot find it in the current environment.fixpip install yake -
LookupError: Resource punkt not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt')cause The `punkt` tokenizer data, a common NLTK resource, has not been downloaded, and YAKE! (or its underlying dependencies for lemmatization) is attempting to use it.fixRun `import nltk; nltk.download('punkt')` in a Python interpreter or script to download the necessary NLTK data. -
TypeError: 'tuple' object is not subscriptable
cause The `extract_keywords` method returns a list of tuples `(keyword, score)`. This error occurs if you try to access elements within these tuples using incorrect indexing (e.g., `keyword[0]` for the first element, `keyword['name']` like a dictionary, or treating the list itself as a dictionary).fixIterate through the list of tuples and unpack them: `for kw, score in keywords: print(f"Keyword: {kw}, Score: {score}")`
Warnings
- breaking Version 0.6.0 introduced a 'Refactored version of YAKE!'. Users upgrading from versions prior to 0.6.0 may encounter breaking API changes, particularly in how `KeywordExtractor` is initialized or its methods are called.
- gotcha If using YAKE!'s lemmatization features (enabled via `spacy` or `nltk` optional dependencies), NLTK data (e.g., 'punkt') might be required, leading to `LookupError` if not downloaded.
- deprecated An older `yake` package (e.g., v0.3.x) is present on PyPI and is officially deprecated and unmaintained. Installing this older version will lead to outdated functionality and no support.
Install
-
pip install yake
Imports
- KeywordExtractor
import yake.KeywordExtractor
from yake import KeywordExtractor
Quickstart
import yake
text = """Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions.
Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week,
the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined
to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million
data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010."""
# Default parameters
kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)
print("Keywords (default settings):")
for kw, score in keywords:
print(f"Keyphrase: {kw}, Score: {score}")
# Customizing parameters
# lan: language, n: max n-gram size, dedupLim: deduplication threshold,
# dedupFunc: deduplication function, windowsSize: window size, top: number of keywords
custom_kw_extractor = yake.KeywordExtractor(lan="en", n=3, dedupLim=0.9, dedupFunc='seqm', windowsSize=3, top=10, features=None)
keywords_custom = custom_kw_extractor.extract_keywords(text)
print("\nKeywords (custom settings):")
for kw, score in keywords_custom:
print(f"Keyphrase: {kw}, Score: {score}")