{"id":1835,"library":"gensim","title":"Gensim","description":"Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It is currently at version 4.4.0 and maintains an active release cadence with regular updates and bug fixes, including recent support for NumPy 2.0. Designed for natural language processing (NLP) and information retrieval (IR) tasks, it supports memory-independent processing for datasets larger than RAM.","status":"active","version":"4.4.0","language":"en","source_language":"en","source_url":"https://github.com/RaRe-Technologies/gensim","tags":["NLP","topic modeling","word embeddings","vector space models","document similarity","machine learning"],"install":[{"cmd":"pip install gensim","lang":"bash","label":"Install latest stable version"}],"dependencies":[{"reason":"Essential for numerical operations and core to Gensim's performance.","package":"numpy","optional":false},{"reason":"Used for scientific computing tasks within Gensim's algorithms.","package":"scipy","optional":false},{"reason":"Enables efficient streaming of very large files, including remote storage and compressed files.","package":"smart_open","optional":false},{"reason":"Highly recommended for significant performance improvements in numerical computations (optional, but NumPy benefits greatly).","package":"BLAS library (e.g., OpenBLAS, MKL, ATLAS)","optional":true}],"imports":[{"symbol":"Word2Vec","correct":"from gensim.models import Word2Vec"},{"symbol":"Doc2Vec","correct":"from gensim.models import Doc2Vec"},{"symbol":"LdaModel","correct":"from gensim.models import LdaModel"},{"symbol":"Dictionary","correct":"from gensim.corpora import Dictionary"},{"symbol":"simple_preprocess","correct":"from gensim.utils import simple_preprocess"},{"note":"In Gensim 4.0+, the `vocab` attribute was removed from `KeyedVectors`. Use `model.wv.index_to_key` for a list of words or `model.wv.key_to_index` for a word-to-index mapping.","wrong":"model.wv.vocab.keys()","symbol":"model.wv.index_to_key","correct":"model.wv.index_to_key"},{"note":"Many vector-related methods (like `most_similar`, `wmdistance`, `doesnt_match`, `similarity`) were moved from the top-level model object to the `KeyedVectors` object (`.wv`) in Gensim 4.0+.","wrong":"model.most_similar(word)","symbol":"model.wv.most_similar()","correct":"model.wv.most_similar(word)"}],"quickstart":{"code":"import logging\nfrom gensim.models import Word2Vec\nfrom gensim.corpora import Dictionary\nfrom gensim.utils import simple_preprocess\n\nlogging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n\n# Sample corpus\ncorpus = [\n    \"Human machine interface for lab abc computer applications\",\n    \"A survey of user opinion of computer system response time\",\n    \"The EPS user interface management system\",\n    \"System and human system engineering testing of EPS\",\n    \"Relation of user perceived response time to error measurement\",\n    \"The generation of random binary non-linear sequences of great length\",\n    \"The interaction of user and computer in an easy way\",\n    \"Doctors use computers for medical diagnosis\",\n    \"A quick brown fox jumps over the lazy dog\"\n]\n\n# Preprocess the corpus: tokenize, lowercase, and filter\ntokenized_corpus = [simple_preprocess(doc) for doc in corpus]\n\n# Create a dictionary from the tokenized corpus\ndictionary = Dictionary(tokenized_corpus)\n\n# Filter out words that appear in less than 2 documents or more than 50% of documents\ndictionary.filter_extremes(no_below=2, no_above=0.5)\n\n# Prepare corpus for Word2Vec training (list of lists of words)\n# Word2Vec expects an iterable of sentences, where each sentence is a list of words.\n\n# Train a Word2Vec model\nmodel = Word2Vec(\n    sentences=tokenized_corpus,  # Your list of tokenized sentences\n    vector_size=100,             # Dimensionality of the word vectors\n    window=5,                    # Maximum distance between the current and predicted word within a sentence\n    min_count=1,                 # Ignores all words with total frequency lower than this\n    workers=4,                   # Use 4 worker threads to train the model\n    epochs=10                    # Number of iterations (epochs) over the corpus\n)\n\n# Save the model\nmodel.save(\"word2vec_model.model\")\n\n# Load the model\nloaded_model = Word2Vec.load(\"word2vec_model.model\")\n\n# Find the most similar words to 'computer'\nif 'computer' in loaded_model.wv:\n    similar_words = loaded_model.wv.most_similar('computer')\n    print(\"\\nWords similar to 'computer':\")\n    for word, score in similar_words:\n        print(f\"{word}: {score:.4f}\")\nelse:\n    print(\"\\n'computer' not in vocabulary.\")","lang":"python","description":"This quickstart demonstrates how to preprocess text, create a dictionary, train a Word2Vec model, and then save, load, and use the model to find similar words. It highlights best practices for corpus preparation and basic model interaction."},"warnings":[{"fix":"Consult the official Gensim 3.x to 4.x migration guide on GitHub. Update attribute access (e.g., `model.wv.index_to_key` instead of `model.wv.vocab.keys()`) and method calls (e.g., `model.wv.most_similar()` instead of `model.most_similar()`). Adjust parameter names for model initialization (e.g., `vector_size` instead of `size`, `epochs` instead of `iter`).","message":"Gensim 4.0 introduced significant breaking API changes from 3.x. Key attributes and methods were renamed or moved. For example, `model.vocab` was replaced by `model.wv.key_to_index` or `model.wv.index_to_key`, and many vector-related methods like `most_similar()` moved from the model object to `model.wv` (KeyedVectors). The `size` parameter was renamed to `vector_size`, and `iter` to `epochs`.","severity":"breaking","affected_versions":"4.0.0 and later"},{"fix":"Upgrade to Python 3.9+ (Gensim 4.4.0 requires >=3.9). If Python 2.7 is strictly necessary, pin Gensim to version 3.8.3: `pip install gensim==3.8.3`.","message":"Gensim 4.0.0 and later versions officially dropped support for Python 2.7. Users requiring Python 2.7 must use Gensim 3.8.3 or an earlier version.","severity":"breaking","affected_versions":"4.0.0 and later"},{"fix":"Employ memory-efficient practices: filter stopwords, remove rare words, and use Gensim's `MmCorpus` or custom iterators that stream data from disk. Reduce dictionary size by filtering `no_below` and `no_above` extremes. Consider using `LdaMulticore` for LDA models with multiple cores.","message":"While Gensim is designed for memory-independent processing, training models like LDA or Word2Vec on extremely large corpora can still lead to high memory consumption, especially if not preprocessed efficiently or if the entire corpus is loaded into RAM.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always install the latest Gensim version (`pip install --upgrade gensim`) which includes the most recent compatibility fixes. If encountering issues, try pinning NumPy to a known compatible version or ensuring your BLAS libraries are correctly configured if building NumPy from source.","message":"Ensure your NumPy version is compatible with your Gensim installation. While Gensim 4.4.0 officially added support for NumPy 2.0, older Gensim 4.x releases (e.g., 4.0.1) had specific compatibility issues with NumPy binary packages on Windows.","severity":"gotcha","affected_versions":"Gensim 4.0.x to 4.3.x, especially with newer NumPy versions."}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}