Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Word Embeddings in NLP

Word embeddings map words to dense numeric vectors where semantically similar words have similar vectors. “King” and “Queen” end up close together; “King” and “Broccoli” end up far apart. This geometric representation of meaning is what powers modern NLP.


Why Dense Vectors?

Bag-of-words gives each word a one-hot vector — a sparse vector with a single 1 in a vocabulary-sized array. These vectors have no relationship to each other; “happy” and “joyful” are as far apart as “happy” and “airplane”.

Word embeddings solve this by placing words in a continuous vector space where distance reflects semantic similarity.


Word2Vec

Word2Vec (Google, 2013) trains on word context: either predict the surrounding words from a center word (Skip-Gram) or predict a center word from surrounding words (CBOW).

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
nltk.download('punkt_tab')
corpus = """
Natural language processing enables computers to understand human language.
Word embeddings capture semantic relationships between words.
Deep learning has revolutionized language understanding and text generation.
Transformer models like BERT and GPT use attention mechanisms.
Language models are trained on massive text corpora.
"""
sentences = [word_tokenize(s.lower()) for s in sent_tokenize(corpus)]
model = Word2Vec(
sentences,
vector_size=100, # embedding dimension
window=5, # context window size
min_count=1, # ignore words with fewer occurrences
sg=1, # 1 = Skip-Gram, 0 = CBOW
epochs=100,
seed=42
)
# Find most similar words
print(model.wv.most_similar("language", topn=3))
# [('human', 0.98), ('natural', 0.97), ('understanding', 0.96)]
# Word arithmetic
result = model.wv.most_similar(positive=['language', 'deep'], negative=['natural'], topn=1)
print(result)
# Vector for a word
vector = model.wv["language"]
print(f"Vector shape: {vector.shape}") # (100,)
print(f"First 5 values: {vector[:5]}")

Using Pre-Trained Word2Vec (Google News)

import gensim.downloader as api
# Download pre-trained model (~1.6 GB)
model = api.load("word2vec-google-news-300")
# Classic analogy: king - man + woman = queen
result = model.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0]) # ('queen', 0.7118...)
# Semantic similarity
print(model.similarity("python", "programming")) # high score
print(model.similarity("python", "broccoli")) # low score

GloVe Embeddings

GloVe (Global Vectors for Word Representation, Stanford) trains on word co-occurrence statistics across the entire corpus — capturing global context rather than local windows.

import numpy as np
def load_glove(glove_file):
embeddings = {}
with open(glove_file, 'r', encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
vector = np.array(values[1:], dtype='float32')
embeddings[word] = vector
return embeddings
# Download from: https://nlp.stanford.edu/projects/glove/
# glove = load_glove("glove.6B.100d.txt")
# Using via gensim
glove_model = api.load("glove-wiki-gigaword-100")
print(glove_model.most_similar("transformer", topn=3))

Contextual Embeddings from Transformers

Static embeddings like Word2Vec give the same vector for “bank” in every context. Contextual embeddings (BERT, RoBERTa, GPT) produce a different vector depending on the surrounding words.

from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
def get_contextual_embedding(text, word_index=0):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Last hidden state: (batch, seq_len, 768)
embeddings = outputs.last_hidden_state[0]
return embeddings[word_index + 1] # +1 for [CLS] token
# "bank" in different contexts → different vectors
text1 = "She deposited money at the bank."
text2 = "They rested on the river bank."
emb1 = get_contextual_embedding(text1, word_index=4)
emb2 = get_contextual_embedding(text2, word_index=5)
similarity = torch.nn.functional.cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0))
print(f"Context similarity for 'bank': {similarity.item():.4f}")
# Low score — different contexts produce different vectors

Comparing Embedding Approaches

ModelContext-awareDimensionVocabularyBest For
Word2VecNo100–300FixedFast similarity, analogies
GloVeNo50–300FixedSame as Word2Vec
FastTextNo100–300SubwordRare words, morphology
BERTYes768BPEDisambiguation, fine-tuning
sentence-transformersYes384–1024BPESemantic search, RAG

Practical Applications

Semantic search — find documents similar in meaning, not just matching keywords.

Sentiment analysis — encode reviews as averaged word vectors, then classify.

Document clustering — group articles by topic using embedding similarity.

RAG (Retrieval-Augmented Generation) — store chunk embeddings in a vector database, retrieve by cosine similarity.

Transfer learning — initialize neural network weights with pre-trained embeddings, then fine-tune.

In 2025, sentence-level embeddings from sentence-transformers have largely replaced raw word embeddings for most downstream tasks. But understanding word embeddings is foundational — they introduced the idea that meaning can be captured geometrically.