Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Stemming and Lemmatization in NLP

Both stemming and lemmatization reduce words to a common base form — but they do it differently, and that difference matters for your application.


The Core Problem They Solve

Consider these word forms: running, runs, ran. For a search engine or text classifier, these should be treated as the same concept. Without normalization, they appear as three unrelated words. Stemming and lemmatization both solve this by mapping them to a shared root.


Stemming: Fast and Crude

Stemming chops off word suffixes using rule-based heuristics. It’s fast and requires no dictionary, but the resulting “stem” is often not a real word.

from nltk.stem import PorterStemmer, SnowballStemmer
ps = PorterStemmer()
words = ["running", "studies", "happily", "generously", "connection"]
for word in words:
print(f"{word:15}{ps.stem(word)}")
# running → run
# studies → studi ← not a real word
# happily → happili ← not a real word
# generously → generous
# connection → connect

Snowball Stemmer (also from NLTK) is more accurate and supports multiple languages:

snow = SnowballStemmer("english")
print(snow.stem("generously")) # → generous
print(snow.stem("arguing")) # → argu

Lemmatization: Slower but Accurate

Lemmatization uses a vocabulary and morphological analysis to return the actual dictionary form (lemma) of a word. It needs to know the word’s part of speech to do this correctly.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')
lem = WordNetLemmatizer()
# Without POS tag — defaults to noun
print(lem.lemmatize("running")) # → running (wrong, treated as noun)
# With correct POS tag
print(lem.lemmatize("running", pos=wordnet.VERB)) # → run
print(lem.lemmatize("studies", pos=wordnet.VERB)) # → study
print(lem.lemmatize("happily", pos=wordnet.ADV)) # → happily (adverbs don't change much)
print(lem.lemmatize("better", pos=wordnet.ADJ)) # → good ← semantically meaningful

spaCy lemmatization automatically handles POS:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The geese were running faster than the children expected.")
for token in doc:
print(f"{token.text:12} → lemma: {token.lemma_:12} pos: {token.pos_}")
# The → lemma: the pos: DET
# geese → lemma: goose pos: NOUN
# were → lemma: be pos: AUX
# running → lemma: run pos: VERB
# faster → lemma: fast pos: ADV

Side-by-Side Comparison

Word | Stemmer (Porter) | Lemmatizer (spaCy)
──────────────────────────────────────────────────────
studies | studi | study
better | better | good
caring | care | care
geese | gees | goose
troubling | troubl | trouble
generously | generous | generously

Stemming is aggressive and fast. Lemmatization is linguistically correct.


When to Use Which

Use stemming when:

Use lemmatization when:


Neither Is Always Needed

In the LLM era (2025), transformer-based models handle morphological variation through their subword tokenization and training data. If you’re building a pipeline with BERT, GPT, or a sentence transformer, you typically skip both stemming and lemmatization entirely — the model handles it internally.

These techniques remain relevant for: