Natural Language Toolkit (NLTK)
NLTK (Natural Language Toolkit) is a leading open-source Python library for Natural Language Processing (NLP). It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Currently at version 3.9.4, NLTK generally follows a release cadence of a few minor versions per year, with more significant updates addressing security and Python compatibility as needed.
Warnings
- breaking NLTK 3.9 introduced a breaking change by replacing pickled models (e.g., for `punkt`, chunkers, taggers) with new pickle-free `_tab` packages to fix security vulnerability CVE-2024-39705. Older versions using pickled models may be insecure or incompatible with newer NLTK versions.
- gotcha Many NLTK functionalities (e.g., tokenizers, taggers, corpora) require downloading specific datasets. Failing to download them will result in `Resource Not Found` errors. Running `nltk.download('all')` can be resource-intensive and unsuitable for production environments.
Install
-
pip install nltk
Imports
- nltk
import nltk
- word_tokenize
from nltk.tokenize import word_tokenize
- PorterStemmer
from nltk.stem import PorterStemmer
Quickstart
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download necessary NLTK data (run once)
try:
nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
nltk.download('punkt')
try:
nltk.data.find('taggers/averaged_perceptron_tagger')
except nltk.downloader.DownloadError:
nltk.download('averaged_perceptron_tagger')
text = "NLTK is a powerful library for natural language processing."
# Tokenization
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")
# Part-of-Speech Tagging
tagged_tokens = pos_tag(tokens)
print(f"POS Tagged: {tagged_tokens}")