Gensim

Gensim is a Python library specializing in unsupervised topic modeling and document similarity. Its strengths are Word2Vec training, Doc2Vec, and LDA topic modeling — all designed to handle large corpora efficiently through memory-streaming rather than loading everything into RAM.

Installation

pip install gensim

Word2Vec — Train on Your Own Corpus

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt_tab')

corpus_text = """
Natural language processing is a subfield of artificial intelligence.
Language models learn to predict text by training on massive corpora.
Word embeddings capture semantic relationships between words in continuous space.
BERT uses bidirectional transformers pretrained on masked language modeling.
GPT models use autoregressive attention to generate coherent text sequences.
"""

sentences = [word_tokenize(s.lower()) for s in sent_tokenize(corpus_text)]

model = Word2Vec(
    sentences,
    vector_size=128,   # embedding dimensions
    window=5,          # context window
    min_count=1,       # minimum word frequency
    sg=1,              # Skip-Gram (1) vs CBOW (0)
    epochs=100,
    seed=42
)

print("Vocabulary size:", len(model.wv))

# Semantic operations
print(model.wv.most_similar("language", topn=3))
print(f"Similarity (language, text): {model.wv.similarity('language', 'text'):.4f}")

# Save and load
model.save("word2vec.model")
loaded = Word2Vec.load("word2vec.model")

Using Pre-Trained Word2Vec Embeddings

import gensim.downloader as api

# Available models
# print(list(api.info()['models'].keys()))

# Load Google News Word2Vec (300-dim, trained on ~100B words)
model = api.load("word2vec-google-news-300")

# Classic analogy
result = model.most_similar(positive=["king", "woman"], negative=["man"])
print("King - Man + Woman =", result[0])  # ('queen', 0.71...)

# Find odd one out
print(model.doesnt_match(["python", "java", "javascript", "broccoli"]))
# broccoli

Doc2Vec — Document-Level Embeddings

Doc2Vec extends Word2Vec to produce a single vector for an entire document:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')

documents = [
    "Transformers use self-attention mechanisms to process sequences in parallel.",
    "BERT is pretrained using masked language modeling and next sentence prediction.",
    "GPT generates text autoregressively by predicting the next token.",
    "LLaMA and Mistral are open-source transformer models from Meta and Mistral AI.",
    "Python is a popular language for scientific computing and data analysis.",
    "Pandas and NumPy are essential libraries for data manipulation in Python."
]

tagged_docs = [
    TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)])
    for i, doc in enumerate(documents)
]

model = Doc2Vec(tagged_docs, vector_size=64, window=3, min_count=1, epochs=100, seed=42)

# Infer vector for a new document
new_doc = "BERT and GPT are both based on the transformer architecture."
new_vector = model.infer_vector(word_tokenize(new_doc.lower()))

# Find most similar training documents
similar_docs = model.dv.most_similar([new_vector], topn=3)
for tag, score in similar_docs:
    print(f"[{score:.4f}] {documents[int(tag)]}")

LDA Topic Modeling

LDA (Latent Dirichlet Allocation) discovers hidden topics in a document collection:

from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

documents = [
    "Machine learning algorithms train models on labeled data to make predictions.",
    "Neural networks use backpropagation to update weights during training.",
    "Gradient descent minimizes the loss function in deep learning models.",
    "Python is widely used for scientific computing and machine learning.",
    "NumPy provides efficient array operations for numerical computation.",
    "Data preprocessing is critical for improving model performance.",
    "Transformers process entire sequences simultaneously using attention.",
    "BERT learns contextual word representations through masked token prediction.",
    "GPT-4 uses a decoder-only transformer architecture for text generation.",
]

def preprocess(text):
    return [t for t in simple_preprocess(text) if t not in STOPWORDS and len(t) > 2]

processed = [preprocess(doc) for doc in documents]
dictionary = corpora.Dictionary(processed)
dictionary.filter_extremes(no_below=1, no_above=0.9)

corpus = [dictionary.doc2bow(doc) for doc in processed]

lda_model = models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=3,
    passes=20,
    alpha='auto',
    random_state=42
)

print("LDA Topics:")
for topic_id, topic in lda_model.print_topics(num_words=6):
    print(f"Topic {topic_id}: {topic}")

TF-IDF Similarity with Gensim

from gensim import corpora, similarities
from gensim.models import TfidfModel
from gensim.utils import simple_preprocess

documents = [
    "The transformer architecture enables parallel sequence processing.",
    "Attention mechanisms allow models to focus on relevant input tokens.",
    "BERT fine-tuning achieves strong results on classification tasks.",
    "Python data science tools include pandas, numpy, and scikit-learn.",
]

corpus_tokens = [simple_preprocess(doc) for doc in documents]
dictionary = corpora.Dictionary(corpus_tokens)
bow_corpus = [dictionary.doc2bow(tokens) for tokens in corpus_tokens]

tfidf = TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

index = similarities.SparseMatrixSimilarity(tfidf_corpus, num_features=len(dictionary))

query = "How does the attention mechanism work?"
query_bow = dictionary.doc2bow(simple_preprocess(query))
query_tfidf = tfidf[query_bow]

sims = sorted(enumerate(index[query_tfidf]), key=lambda x: -x[1])
for doc_id, score in sims:
    print(f"[{score:.4f}] {documents[doc_id]}")

Gensim vs Alternatives

Task	Gensim	sklearn	Hugging Face
Word2Vec training	Native, fast	No	Via `gensim` integration
Doc2Vec	Native	No	No
LDA topic modeling	Native	NMF only	No
Large corpora	Memory-streaming	In-memory	Model-dependent
Pre-trained embeddings	Hub via API	No	Large Hub
Transformer models	No	No	Excellent

Gensim is the right choice when you need Word2Vec, Doc2Vec, or LDA and want memory-efficient processing of large document collections. For transformer-based embeddings, use Hugging Face instead.