Gensim
Gensim is a Python library specializing in unsupervised topic modeling and document similarity. Its strengths are Word2Vec training, Doc2Vec, and LDA topic modeling — all designed to handle large corpora efficiently through memory-streaming rather than loading everything into RAM.
Installation
pip install gensimWord2Vec — Train on Your Own Corpus
from gensim.models import Word2Vecfrom nltk.tokenize import word_tokenize, sent_tokenizeimport nltknltk.download('punkt_tab')
corpus_text = """Natural language processing is a subfield of artificial intelligence.Language models learn to predict text by training on massive corpora.Word embeddings capture semantic relationships between words in continuous space.BERT uses bidirectional transformers pretrained on masked language modeling.GPT models use autoregressive attention to generate coherent text sequences."""
sentences = [word_tokenize(s.lower()) for s in sent_tokenize(corpus_text)]
model = Word2Vec( sentences, vector_size=128, # embedding dimensions window=5, # context window min_count=1, # minimum word frequency sg=1, # Skip-Gram (1) vs CBOW (0) epochs=100, seed=42)
print("Vocabulary size:", len(model.wv))
# Semantic operationsprint(model.wv.most_similar("language", topn=3))print(f"Similarity (language, text): {model.wv.similarity('language', 'text'):.4f}")
# Save and loadmodel.save("word2vec.model")loaded = Word2Vec.load("word2vec.model")Using Pre-Trained Word2Vec Embeddings
import gensim.downloader as api
# Available models# print(list(api.info()['models'].keys()))
# Load Google News Word2Vec (300-dim, trained on ~100B words)model = api.load("word2vec-google-news-300")
# Classic analogyresult = model.most_similar(positive=["king", "woman"], negative=["man"])print("King - Man + Woman =", result[0]) # ('queen', 0.71...)
# Find odd one outprint(model.doesnt_match(["python", "java", "javascript", "broccoli"]))# broccoliDoc2Vec — Document-Level Embeddings
Doc2Vec extends Word2Vec to produce a single vector for an entire document:
from gensim.models.doc2vec import Doc2Vec, TaggedDocumentfrom nltk.tokenize import word_tokenizeimport nltknltk.download('punkt_tab')
documents = [ "Transformers use self-attention mechanisms to process sequences in parallel.", "BERT is pretrained using masked language modeling and next sentence prediction.", "GPT generates text autoregressively by predicting the next token.", "LLaMA and Mistral are open-source transformer models from Meta and Mistral AI.", "Python is a popular language for scientific computing and data analysis.", "Pandas and NumPy are essential libraries for data manipulation in Python."]
tagged_docs = [ TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) for i, doc in enumerate(documents)]
model = Doc2Vec(tagged_docs, vector_size=64, window=3, min_count=1, epochs=100, seed=42)
# Infer vector for a new documentnew_doc = "BERT and GPT are both based on the transformer architecture."new_vector = model.infer_vector(word_tokenize(new_doc.lower()))
# Find most similar training documentssimilar_docs = model.dv.most_similar([new_vector], topn=3)for tag, score in similar_docs: print(f"[{score:.4f}] {documents[int(tag)]}")LDA Topic Modeling
LDA (Latent Dirichlet Allocation) discovers hidden topics in a document collection:
from gensim import corpora, modelsfrom gensim.utils import simple_preprocessfrom gensim.parsing.preprocessing import STOPWORDS
documents = [ "Machine learning algorithms train models on labeled data to make predictions.", "Neural networks use backpropagation to update weights during training.", "Gradient descent minimizes the loss function in deep learning models.", "Python is widely used for scientific computing and machine learning.", "NumPy provides efficient array operations for numerical computation.", "Data preprocessing is critical for improving model performance.", "Transformers process entire sequences simultaneously using attention.", "BERT learns contextual word representations through masked token prediction.", "GPT-4 uses a decoder-only transformer architecture for text generation.",]
def preprocess(text): return [t for t in simple_preprocess(text) if t not in STOPWORDS and len(t) > 2]
processed = [preprocess(doc) for doc in documents]dictionary = corpora.Dictionary(processed)dictionary.filter_extremes(no_below=1, no_above=0.9)
corpus = [dictionary.doc2bow(doc) for doc in processed]
lda_model = models.LdaModel( corpus=corpus, id2word=dictionary, num_topics=3, passes=20, alpha='auto', random_state=42)
print("LDA Topics:")for topic_id, topic in lda_model.print_topics(num_words=6): print(f"Topic {topic_id}: {topic}")TF-IDF Similarity with Gensim
from gensim import corpora, similaritiesfrom gensim.models import TfidfModelfrom gensim.utils import simple_preprocess
documents = [ "The transformer architecture enables parallel sequence processing.", "Attention mechanisms allow models to focus on relevant input tokens.", "BERT fine-tuning achieves strong results on classification tasks.", "Python data science tools include pandas, numpy, and scikit-learn.",]
corpus_tokens = [simple_preprocess(doc) for doc in documents]dictionary = corpora.Dictionary(corpus_tokens)bow_corpus = [dictionary.doc2bow(tokens) for tokens in corpus_tokens]
tfidf = TfidfModel(bow_corpus)tfidf_corpus = tfidf[bow_corpus]
index = similarities.SparseMatrixSimilarity(tfidf_corpus, num_features=len(dictionary))
query = "How does the attention mechanism work?"query_bow = dictionary.doc2bow(simple_preprocess(query))query_tfidf = tfidf[query_bow]
sims = sorted(enumerate(index[query_tfidf]), key=lambda x: -x[1])for doc_id, score in sims: print(f"[{score:.4f}] {documents[doc_id]}")Gensim vs Alternatives
| Task | Gensim | sklearn | Hugging Face |
|---|---|---|---|
| Word2Vec training | Native, fast | No | Via gensim integration |
| Doc2Vec | Native | No | No |
| LDA topic modeling | Native | NMF only | No |
| Large corpora | Memory-streaming | In-memory | Model-dependent |
| Pre-trained embeddings | Hub via API | No | Large Hub |
| Transformer models | No | No | Excellent |
Gensim is the right choice when you need Word2Vec, Doc2Vec, or LDA and want memory-efficient processing of large document collections. For transformer-based embeddings, use Hugging Face instead.