Word Embeddings in NLP

Word embeddings map words to dense numeric vectors where semantically similar words have similar vectors. “King” and “Queen” end up close together; “King” and “Broccoli” end up far apart. This geometric representation of meaning is what powers modern NLP.

Why Dense Vectors?

Bag-of-words gives each word a one-hot vector — a sparse vector with a single 1 in a vocabulary-sized array. These vectors have no relationship to each other; “happy” and “joyful” are as far apart as “happy” and “airplane”.

Word embeddings solve this by placing words in a continuous vector space where distance reflects semantic similarity.

Word2Vec

Word2Vec (Google, 2013) trains on word context: either predict the surrounding words from a center word (Skip-Gram) or predict a center word from surrounding words (CBOW).

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt_tab')

corpus = """
Natural language processing enables computers to understand human language.
Word embeddings capture semantic relationships between words.
Deep learning has revolutionized language understanding and text generation.
Transformer models like BERT and GPT use attention mechanisms.
Language models are trained on massive text corpora.
"""

sentences = [word_tokenize(s.lower()) for s in sent_tokenize(corpus)]

model = Word2Vec(
    sentences,
    vector_size=100,  # embedding dimension
    window=5,         # context window size
    min_count=1,      # ignore words with fewer occurrences
    sg=1,             # 1 = Skip-Gram, 0 = CBOW
    epochs=100,
    seed=42
)

# Find most similar words
print(model.wv.most_similar("language", topn=3))
# [('human', 0.98), ('natural', 0.97), ('understanding', 0.96)]

# Word arithmetic
result = model.wv.most_similar(positive=['language', 'deep'], negative=['natural'], topn=1)
print(result)

# Vector for a word
vector = model.wv["language"]
print(f"Vector shape: {vector.shape}")  # (100,)
print(f"First 5 values: {vector[:5]}")

Using Pre-Trained Word2Vec (Google News)

import gensim.downloader as api

# Download pre-trained model (~1.6 GB)
model = api.load("word2vec-google-news-300")

# Classic analogy: king - man + woman = queen
result = model.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0])  # ('queen', 0.7118...)

# Semantic similarity
print(model.similarity("python", "programming"))  # high score
print(model.similarity("python", "broccoli"))     # low score

GloVe Embeddings

GloVe (Global Vectors for Word Representation, Stanford) trains on word co-occurrence statistics across the entire corpus — capturing global context rather than local windows.

import numpy as np

def load_glove(glove_file):
    embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector
    return embeddings

# Download from: https://nlp.stanford.edu/projects/glove/
# glove = load_glove("glove.6B.100d.txt")

# Using via gensim
glove_model = api.load("glove-wiki-gigaword-100")
print(glove_model.most_similar("transformer", topn=3))

Contextual Embeddings from Transformers

Static embeddings like Word2Vec give the same vector for “bank” in every context. Contextual embeddings (BERT, RoBERTa, GPT) produce a different vector depending on the surrounding words.

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

def get_contextual_embedding(text, word_index=0):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    # Last hidden state: (batch, seq_len, 768)
    embeddings = outputs.last_hidden_state[0]
    return embeddings[word_index + 1]  # +1 for [CLS] token

# "bank" in different contexts → different vectors
text1 = "She deposited money at the bank."
text2 = "They rested on the river bank."

emb1 = get_contextual_embedding(text1, word_index=4)
emb2 = get_contextual_embedding(text2, word_index=5)

similarity = torch.nn.functional.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0))
print(f"Context similarity for 'bank': {similarity.item():.4f}")
# Low score — different contexts produce different vectors

Comparing Embedding Approaches

Model	Context-aware	Dimension	Vocabulary	Best For
Word2Vec	No	100–300	Fixed	Fast similarity, analogies
GloVe	No	50–300	Fixed	Same as Word2Vec
FastText	No	100–300	Subword	Rare words, morphology
BERT	Yes	768	BPE	Disambiguation, fine-tuning
sentence-transformers	Yes	384–1024	BPE	Semantic search, RAG

Practical Applications

Semantic search — find documents similar in meaning, not just matching keywords.

Sentiment analysis — encode reviews as averaged word vectors, then classify.

Document clustering — group articles by topic using embedding similarity.

RAG (Retrieval-Augmented Generation) — store chunk embeddings in a vector database, retrieve by cosine similarity.

Transfer learning — initialize neural network weights with pre-trained embeddings, then fine-tune.

In 2025, sentence-level embeddings from sentence-transformers have largely replaced raw word embeddings for most downstream tasks. But understanding word embeddings is foundational — they introduced the idea that meaning can be captured geometrically.