Stemming and Lemmatization in NLP

Both stemming and lemmatization reduce words to a common base form — but they do it differently, and that difference matters for your application.

The Core Problem They Solve

Consider these word forms: running, runs, ran. For a search engine or text classifier, these should be treated as the same concept. Without normalization, they appear as three unrelated words. Stemming and lemmatization both solve this by mapping them to a shared root.

Stemming: Fast and Crude

Stemming chops off word suffixes using rule-based heuristics. It’s fast and requires no dictionary, but the resulting “stem” is often not a real word.

from nltk.stem import PorterStemmer, SnowballStemmer

ps = PorterStemmer()
words = ["running", "studies", "happily", "generously", "connection"]

for word in words:
    print(f"{word:15} → {ps.stem(word)}")

# running         → run
# studies         → studi     ← not a real word
# happily         → happili   ← not a real word
# generously      → generous
# connection      → connect

Snowball Stemmer (also from NLTK) is more accurate and supports multiple languages:

snow = SnowballStemmer("english")
print(snow.stem("generously"))  # → generous
print(snow.stem("arguing"))     # → argu

Lemmatization: Slower but Accurate

Lemmatization uses a vocabulary and morphological analysis to return the actual dictionary form (lemma) of a word. It needs to know the word’s part of speech to do this correctly.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
nltk.download('wordnet')

lem = WordNetLemmatizer()

# Without POS tag — defaults to noun
print(lem.lemmatize("running"))   # → running  (wrong, treated as noun)

# With correct POS tag
print(lem.lemmatize("running", pos=wordnet.VERB))  # → run
print(lem.lemmatize("studies",  pos=wordnet.VERB))  # → study
print(lem.lemmatize("happily",  pos=wordnet.ADV))   # → happily (adverbs don't change much)
print(lem.lemmatize("better",   pos=wordnet.ADJ))   # → good  ← semantically meaningful

spaCy lemmatization automatically handles POS:

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("The geese were running faster than the children expected.")
for token in doc:
    print(f"{token.text:12} → lemma: {token.lemma_:12} pos: {token.pos_}")

# The          → lemma: the          pos: DET
# geese        → lemma: goose        pos: NOUN
# were         → lemma: be           pos: AUX
# running      → lemma: run          pos: VERB
# faster       → lemma: fast         pos: ADV

Side-by-Side Comparison

Word          | Stemmer (Porter) | Lemmatizer (spaCy)
──────────────────────────────────────────────────────
studies       | studi            | study
better        | better           | good
caring        | care             | care
geese         | gees             | goose
troubling     | troubl           | trouble
generously    | generous         | generously

Stemming is aggressive and fast. Lemmatization is linguistically correct.

When to Use Which

Use stemming when:

Speed is critical (large-scale search indexing)
You need a quick baseline and linguistic precision isn’t required
Building a keyword-matching system
Working with a language that has no lemmatizer

Use lemmatization when:

Accuracy matters more than speed
Building classifiers or models that need real words
Doing linguistic analysis or annotation
Working with a downstream model that benefits from correct base forms

Neither Is Always Needed

In the LLM era (2025), transformer-based models handle morphological variation through their subword tokenization and training data. If you’re building a pipeline with BERT, GPT, or a sentence transformer, you typically skip both stemming and lemmatization entirely — the model handles it internally.

These techniques remain relevant for:

Classic ML pipelines (TF-IDF + logistic regression)
Search engine preprocessing
Rule-based NLP tools
Resource-constrained environments