Natural Language Toolkit (NLTK)

raw JSON →
3.9.4 verified Tue May 12 auth: no python install: verified quickstart: stale

NLTK (Natural Language Toolkit) is a leading open-source Python library for Natural Language Processing (NLP). It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Currently at version 3.9.4, NLTK generally follows a release cadence of a few minor versions per year, with more significant updates addressing security and Python compatibility as needed.

pip install nltk
error ModuleNotFoundError: No module named 'nltk'
cause The NLTK library is not installed in the Python environment you are currently using, or there's an issue with your Python PATH.
fix
Install NLTK using pip: pip install nltk or pip3 install nltk.
error LookupError: Resource 'punkt' not found. Please use the NLTK Downloader to obtain the resource:
cause NLTK requires additional data packages (like 'punkt' for tokenization, 'stopwords' for stop word lists, 'wordnet' for lexical resources, etc.) that are not included in the initial library installation and must be downloaded separately.
fix
Open a Python interpreter and run import nltk; nltk.download('punkt') to download the specific 'punkt' tokenizer. For other resources, replace 'punkt' with the name of the missing resource (e.g., 'stopwords', 'wordnet', 'averaged_perceptron_tagger'), or run nltk.download('all') to download all popular NLTK data collections.
error Error loading [resource name]: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
cause The NLTK data downloader is encountering an SSL certificate verification issue, often due to corporate network proxies, firewalls, or an outdated Python installation's certificate store.
fix
Bypass SSL verification for the NLTK download. In your Python script or interpreter, add the following before nltk.download(): import ssl; try: _create_unverified_https_context = ssl._create_unverified_https_context except AttributeError: pass else: ssl._create_default_https_context = _create_unverified_https_context; nltk.download('popular') (or the specific resource you need).
error AttributeError: module 'nltk' has no attribute 'download'
cause This usually happens if you've inadvertently named one of your Python files 'nltk.py', which causes Python to import your local file instead of the actual NLTK library, or if you're using a very old or corrupted NLTK installation.
fix
Rename your Python script if it's named nltk.py (or any other name that conflicts with an NLTK module). If that's not the case, ensure NLTK is properly installed and updated by running pip install --upgrade nltk.
breaking NLTK 3.9 introduced a breaking change by replacing pickled models (e.g., for `punkt`, chunkers, taggers) with new pickle-free `_tab` packages to fix security vulnerability CVE-2024-39705. Older versions using pickled models may be insecure or incompatible with newer NLTK versions.
fix Upgrade NLTK to version 3.9 or higher. Ensure your application is updated to use the new `_tab` packages or re-download corpora with `nltk.download()` after upgrading. Specifically, NLTK 3.9.3 fixed CVE-2025-14009 related to secure ZIP extraction.
gotcha Many NLTK functionalities (e.g., tokenizers, taggers, corpora) require downloading specific datasets. Failing to download them will result in `Resource Not Found` errors. Running `nltk.download('all')` can be resource-intensive and unsuitable for production environments.
fix Before using a specific NLTK module that relies on external data, ensure the necessary data is downloaded. For production, explicitly download only the required packages using `nltk.download('package_name')` once during setup, or use `nltk.data.path.append('/path/to/nltk_data')` to point to pre-downloaded data. For example, `nltk.download('punkt')` for the Punkt tokenizer.
breaking The `nltk.downloader.DownloadError` exception class was deprecated and removed in NLTK versions 3.8.1 and higher. Code attempting to catch `nltk.downloader.DownloadError` will raise an `AttributeError`. The replacement is `nltk.downloader.NLTKDownloadError` or its base class `nltk.downloader.NLTKDownloaderException`.
fix Update exception handling in your code to catch `nltk.downloader.NLTKDownloadError` or `nltk.downloader.NLTKDownloaderException` instead of `nltk.downloader.DownloadError`.
python os / libc status wheel install import disk
3.10 alpine (musl) - - 0.99s 35.4M
3.10 slim (glibc) - - 0.66s 36M
3.11 alpine (musl) - - 1.63s 40.5M
3.11 slim (glibc) - - 1.35s 41M
3.12 alpine (musl) - - 1.20s 31.5M
3.12 slim (glibc) - - 1.29s 32M
3.13 alpine (musl) - - 0.83s 31.0M
3.13 slim (glibc) - - 1.32s 32M
3.9 alpine (musl) - - 0.80s 34.5M
3.9 slim (glibc) - - 0.69s 36M

This quickstart demonstrates basic text tokenization and Part-of-Speech (POS) tagging using NLTK. It includes checks to download the 'punkt' tokenizer and 'averaged_perceptron_tagger' if they are not already present, which are common requirements for many NLTK operations. This ensures the example is runnable out-of-the-box.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Download necessary NLTK data (run once)
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except nltk.downloader.DownloadError:
    nltk.download('averaged_perceptron_tagger')

text = "NLTK is a powerful library for natural language processing."

# Tokenization
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")

# Part-of-Speech Tagging
tagged_tokens = pos_tag(tokens)
print(f"POS Tagged: {tagged_tokens}")