Gensim

4.4.0 · active · verified Thu Apr 09

Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It is currently at version 4.4.0 and maintains an active release cadence with regular updates and bug fixes, including recent support for NumPy 2.0. Designed for natural language processing (NLP) and information retrieval (IR) tasks, it supports memory-independent processing for datasets larger than RAM.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to preprocess text, create a dictionary, train a Word2Vec model, and then save, load, and use the model to find similar words. It highlights best practices for corpus preparation and basic model interaction.

import logging
from gensim.models import Word2Vec
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Sample corpus
corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary non-linear sequences of great length",
    "The interaction of user and computer in an easy way",
    "Doctors use computers for medical diagnosis",
    "A quick brown fox jumps over the lazy dog"
]

# Preprocess the corpus: tokenize, lowercase, and filter
tokenized_corpus = [simple_preprocess(doc) for doc in corpus]

# Create a dictionary from the tokenized corpus
dictionary = Dictionary(tokenized_corpus)

# Filter out words that appear in less than 2 documents or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)

# Prepare corpus for Word2Vec training (list of lists of words)
# Word2Vec expects an iterable of sentences, where each sentence is a list of words.

# Train a Word2Vec model
model = Word2Vec(
    sentences=tokenized_corpus,  # Your list of tokenized sentences
    vector_size=100,             # Dimensionality of the word vectors
    window=5,                    # Maximum distance between the current and predicted word within a sentence
    min_count=1,                 # Ignores all words with total frequency lower than this
    workers=4,                   # Use 4 worker threads to train the model
    epochs=10                    # Number of iterations (epochs) over the corpus
)

# Save the model
model.save("word2vec_model.model")

# Load the model
loaded_model = Word2Vec.load("word2vec_model.model")

# Find the most similar words to 'computer'
if 'computer' in loaded_model.wv:
    similar_words = loaded_model.wv.most_similar('computer')
    print("\nWords similar to 'computer':")
    for word, score in similar_words:
        print(f"{word}: {score:.4f}")
else:
    print("\n'computer' not in vocabulary.")

view raw JSON →