Natural Language Toolkit (NLTK)
NLTK (Natural Language Toolkit) is a leading open-source Python library for Natural Language Processing (NLP). It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Currently at version 3.9.4, NLTK generally follows a release cadence of a few minor versions per year, with more significant updates addressing security and Python compatibility as needed.
Common errors
-
ModuleNotFoundError: No module named 'nltk'
cause The NLTK library is not installed in the Python environment you are currently using, or there's an issue with your Python PATH.fixInstall NLTK using pip: `pip install nltk` or `pip3 install nltk`. -
LookupError: Resource 'punkt' not found. Please use the NLTK Downloader to obtain the resource:
cause NLTK requires additional data packages (like 'punkt' for tokenization, 'stopwords' for stop word lists, 'wordnet' for lexical resources, etc.) that are not included in the initial library installation and must be downloaded separately.fixOpen a Python interpreter and run `import nltk; nltk.download('punkt')` to download the specific 'punkt' tokenizer. For other resources, replace 'punkt' with the name of the missing resource (e.g., 'stopwords', 'wordnet', 'averaged_perceptron_tagger'), or run `nltk.download('all')` to download all popular NLTK data collections. -
Error loading [resource name]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
cause The NLTK data downloader is encountering an SSL certificate verification issue, often due to corporate network proxies, firewalls, or an outdated Python installation's certificate store.fixBypass SSL verification for the NLTK download. In your Python script or interpreter, add the following before `nltk.download()`: `import ssl; try: _create_unverified_https_context = ssl._create_unverified_https_context except AttributeError: pass else: ssl._create_default_https_context = _create_unverified_https_context; nltk.download('popular')` (or the specific resource you need). -
AttributeError: module 'nltk' has no attribute 'download'
cause This usually happens if you've inadvertently named one of your Python files 'nltk.py', which causes Python to import your local file instead of the actual NLTK library, or if you're using a very old or corrupted NLTK installation.fixRename your Python script if it's named `nltk.py` (or any other name that conflicts with an NLTK module). If that's not the case, ensure NLTK is properly installed and updated by running `pip install --upgrade nltk`.
Warnings
- breaking NLTK 3.9 introduced a breaking change by replacing pickled models (e.g., for `punkt`, chunkers, taggers) with new pickle-free `_tab` packages to fix security vulnerability CVE-2024-39705. Older versions using pickled models may be insecure or incompatible with newer NLTK versions.
- gotcha Many NLTK functionalities (e.g., tokenizers, taggers, corpora) require downloading specific datasets. Failing to download them will result in `Resource Not Found` errors. Running `nltk.download('all')` can be resource-intensive and unsuitable for production environments.
- breaking The `nltk.downloader.DownloadError` exception class was deprecated and removed in NLTK versions 3.8.1 and higher. Code attempting to catch `nltk.downloader.DownloadError` will raise an `AttributeError`. The replacement is `nltk.downloader.NLTKDownloadError` or its base class `nltk.downloader.NLTKDownloaderException`.
Install
-
pip install nltk
Imports
- nltk
import nltk
- word_tokenize
from nltk.tokenize import word_tokenize
- PorterStemmer
from nltk.stem import PorterStemmer
Quickstart
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download necessary NLTK data (run once)
try:
nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
nltk.download('punkt')
try:
nltk.data.find('taggers/averaged_perceptron_tagger')
except nltk.downloader.DownloadError:
nltk.download('averaged_perceptron_tagger')
text = "NLTK is a powerful library for natural language processing."
# Tokenization
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")
# Part-of-Speech Tagging
tagged_tokens = pos_tag(tokens)
print(f"POS Tagged: {tagged_tokens}")