Gensim
Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It is currently at version 4.4.0 and maintains an active release cadence with regular updates and bug fixes, including recent support for NumPy 2.0. Designed for natural language processing (NLP) and information retrieval (IR) tasks, it supports memory-independent processing for datasets larger than RAM.
Warnings
- breaking Gensim 4.0 introduced significant breaking API changes from 3.x. Key attributes and methods were renamed or moved. For example, `model.vocab` was replaced by `model.wv.key_to_index` or `model.wv.index_to_key`, and many vector-related methods like `most_similar()` moved from the model object to `model.wv` (KeyedVectors). The `size` parameter was renamed to `vector_size`, and `iter` to `epochs`.
- breaking Gensim 4.0.0 and later versions officially dropped support for Python 2.7. Users requiring Python 2.7 must use Gensim 3.8.3 or an earlier version.
- gotcha While Gensim is designed for memory-independent processing, training models like LDA or Word2Vec on extremely large corpora can still lead to high memory consumption, especially if not preprocessed efficiently or if the entire corpus is loaded into RAM.
- gotcha Ensure your NumPy version is compatible with your Gensim installation. While Gensim 4.4.0 officially added support for NumPy 2.0, older Gensim 4.x releases (e.g., 4.0.1) had specific compatibility issues with NumPy binary packages on Windows.
Install
-
pip install gensim
Imports
- Word2Vec
from gensim.models import Word2Vec
- Doc2Vec
from gensim.models import Doc2Vec
- LdaModel
from gensim.models import LdaModel
- Dictionary
from gensim.corpora import Dictionary
- simple_preprocess
from gensim.utils import simple_preprocess
- model.wv.index_to_key
model.wv.index_to_key
- model.wv.most_similar()
model.wv.most_similar(word)
Quickstart
import logging
from gensim.models import Word2Vec
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# Sample corpus
corpus = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary non-linear sequences of great length",
"The interaction of user and computer in an easy way",
"Doctors use computers for medical diagnosis",
"A quick brown fox jumps over the lazy dog"
]
# Preprocess the corpus: tokenize, lowercase, and filter
tokenized_corpus = [simple_preprocess(doc) for doc in corpus]
# Create a dictionary from the tokenized corpus
dictionary = Dictionary(tokenized_corpus)
# Filter out words that appear in less than 2 documents or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
# Prepare corpus for Word2Vec training (list of lists of words)
# Word2Vec expects an iterable of sentences, where each sentence is a list of words.
# Train a Word2Vec model
model = Word2Vec(
sentences=tokenized_corpus, # Your list of tokenized sentences
vector_size=100, # Dimensionality of the word vectors
window=5, # Maximum distance between the current and predicted word within a sentence
min_count=1, # Ignores all words with total frequency lower than this
workers=4, # Use 4 worker threads to train the model
epochs=10 # Number of iterations (epochs) over the corpus
)
# Save the model
model.save("word2vec_model.model")
# Load the model
loaded_model = Word2Vec.load("word2vec_model.model")
# Find the most similar words to 'computer'
if 'computer' in loaded_model.wv:
similar_words = loaded_model.wv.most_similar('computer')
print("\nWords similar to 'computer':")
for word, score in similar_words:
print(f"{word}: {score:.4f}")
else:
print("\n'computer' not in vocabulary.")