Word Embeddings in NLP
Word embeddings map words to dense numeric vectors where semantically similar words have similar vectors. “King” and “Queen” end up close together; “King” and “Broccoli” end up far apart. This geometric representation of meaning is what powers modern NLP.
Why Dense Vectors?
Bag-of-words gives each word a one-hot vector — a sparse vector with a single 1 in a vocabulary-sized array. These vectors have no relationship to each other; “happy” and “joyful” are as far apart as “happy” and “airplane”.
Word embeddings solve this by placing words in a continuous vector space where distance reflects semantic similarity.
Word2Vec
Word2Vec (Google, 2013) trains on word context: either predict the surrounding words from a center word (Skip-Gram) or predict a center word from surrounding words (CBOW).
from gensim.models import Word2Vecfrom nltk.tokenize import word_tokenize, sent_tokenizeimport nltknltk.download('punkt_tab')
corpus = """Natural language processing enables computers to understand human language.Word embeddings capture semantic relationships between words.Deep learning has revolutionized language understanding and text generation.Transformer models like BERT and GPT use attention mechanisms.Language models are trained on massive text corpora."""
sentences = [word_tokenize(s.lower()) for s in sent_tokenize(corpus)]
model = Word2Vec( sentences, vector_size=100, # embedding dimension window=5, # context window size min_count=1, # ignore words with fewer occurrences sg=1, # 1 = Skip-Gram, 0 = CBOW epochs=100, seed=42)
# Find most similar wordsprint(model.wv.most_similar("language", topn=3))# [('human', 0.98), ('natural', 0.97), ('understanding', 0.96)]
# Word arithmeticresult = model.wv.most_similar(positive=['language', 'deep'], negative=['natural'], topn=1)print(result)
# Vector for a wordvector = model.wv["language"]print(f"Vector shape: {vector.shape}") # (100,)print(f"First 5 values: {vector[:5]}")Using Pre-Trained Word2Vec (Google News)
import gensim.downloader as api
# Download pre-trained model (~1.6 GB)model = api.load("word2vec-google-news-300")
# Classic analogy: king - man + woman = queenresult = model.most_similar(positive=["king", "woman"], negative=["man"])print(result[0]) # ('queen', 0.7118...)
# Semantic similarityprint(model.similarity("python", "programming")) # high scoreprint(model.similarity("python", "broccoli")) # low scoreGloVe Embeddings
GloVe (Global Vectors for Word Representation, Stanford) trains on word co-occurrence statistics across the entire corpus — capturing global context rather than local windows.
import numpy as np
def load_glove(glove_file): embeddings = {} with open(glove_file, 'r', encoding='utf-8') as f: for line in f: values = line.split() word = values[0] vector = np.array(values[1:], dtype='float32') embeddings[word] = vector return embeddings
# Download from: https://nlp.stanford.edu/projects/glove/# glove = load_glove("glove.6B.100d.txt")
# Using via gensimglove_model = api.load("glove-wiki-gigaword-100")print(glove_model.most_similar("transformer", topn=3))Contextual Embeddings from Transformers
Static embeddings like Word2Vec give the same vector for “bank” in every context. Contextual embeddings (BERT, RoBERTa, GPT) produce a different vector depending on the surrounding words.
from transformers import AutoTokenizer, AutoModelimport torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = AutoModel.from_pretrained("bert-base-uncased")
def get_contextual_embedding(text, word_index=0): inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs)
# Last hidden state: (batch, seq_len, 768) embeddings = outputs.last_hidden_state[0] return embeddings[word_index + 1] # +1 for [CLS] token
# "bank" in different contexts → different vectorstext1 = "She deposited money at the bank."text2 = "They rested on the river bank."
emb1 = get_contextual_embedding(text1, word_index=4)emb2 = get_contextual_embedding(text2, word_index=5)
similarity = torch.nn.functional.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0))print(f"Context similarity for 'bank': {similarity.item():.4f}")# Low score — different contexts produce different vectorsComparing Embedding Approaches
| Model | Context-aware | Dimension | Vocabulary | Best For |
|---|---|---|---|---|
| Word2Vec | No | 100–300 | Fixed | Fast similarity, analogies |
| GloVe | No | 50–300 | Fixed | Same as Word2Vec |
| FastText | No | 100–300 | Subword | Rare words, morphology |
| BERT | Yes | 768 | BPE | Disambiguation, fine-tuning |
| sentence-transformers | Yes | 384–1024 | BPE | Semantic search, RAG |
Practical Applications
Semantic search — find documents similar in meaning, not just matching keywords.
Sentiment analysis — encode reviews as averaged word vectors, then classify.
Document clustering — group articles by topic using embedding similarity.
RAG (Retrieval-Augmented Generation) — store chunk embeddings in a vector database, retrieve by cosine similarity.
Transfer learning — initialize neural network weights with pre-trained embeddings, then fine-tune.
In 2025, sentence-level embeddings from sentence-transformers have largely replaced raw word embeddings for most downstream tasks. But understanding word embeddings is foundational — they introduced the idea that meaning can be captured geometrically.