Stemming and Lemmatization in NLP
Both stemming and lemmatization reduce words to a common base form — but they do it differently, and that difference matters for your application.
The Core Problem They Solve
Consider these word forms: running, runs, ran. For a search engine or text classifier, these should be treated as the same concept. Without normalization, they appear as three unrelated words. Stemming and lemmatization both solve this by mapping them to a shared root.
Stemming: Fast and Crude
Stemming chops off word suffixes using rule-based heuristics. It’s fast and requires no dictionary, but the resulting “stem” is often not a real word.
from nltk.stem import PorterStemmer, SnowballStemmer
ps = PorterStemmer()words = ["running", "studies", "happily", "generously", "connection"]
for word in words: print(f"{word:15} → {ps.stem(word)}")
# running → run# studies → studi ← not a real word# happily → happili ← not a real word# generously → generous# connection → connectSnowball Stemmer (also from NLTK) is more accurate and supports multiple languages:
snow = SnowballStemmer("english")print(snow.stem("generously")) # → generousprint(snow.stem("arguing")) # → arguLemmatization: Slower but Accurate
Lemmatization uses a vocabulary and morphological analysis to return the actual dictionary form (lemma) of a word. It needs to know the word’s part of speech to do this correctly.
from nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnetimport nltknltk.download('wordnet')
lem = WordNetLemmatizer()
# Without POS tag — defaults to nounprint(lem.lemmatize("running")) # → running (wrong, treated as noun)
# With correct POS tagprint(lem.lemmatize("running", pos=wordnet.VERB)) # → runprint(lem.lemmatize("studies", pos=wordnet.VERB)) # → studyprint(lem.lemmatize("happily", pos=wordnet.ADV)) # → happily (adverbs don't change much)print(lem.lemmatize("better", pos=wordnet.ADJ)) # → good ← semantically meaningfulspaCy lemmatization automatically handles POS:
import spacynlp = spacy.load("en_core_web_sm")
doc = nlp("The geese were running faster than the children expected.")for token in doc: print(f"{token.text:12} → lemma: {token.lemma_:12} pos: {token.pos_}")
# The → lemma: the pos: DET# geese → lemma: goose pos: NOUN# were → lemma: be pos: AUX# running → lemma: run pos: VERB# faster → lemma: fast pos: ADVSide-by-Side Comparison
Word | Stemmer (Porter) | Lemmatizer (spaCy)──────────────────────────────────────────────────────studies | studi | studybetter | better | goodcaring | care | caregeese | gees | goosetroubling | troubl | troublegenerously | generous | generouslyStemming is aggressive and fast. Lemmatization is linguistically correct.
When to Use Which
Use stemming when:
- Speed is critical (large-scale search indexing)
- You need a quick baseline and linguistic precision isn’t required
- Building a keyword-matching system
- Working with a language that has no lemmatizer
Use lemmatization when:
- Accuracy matters more than speed
- Building classifiers or models that need real words
- Doing linguistic analysis or annotation
- Working with a downstream model that benefits from correct base forms
Neither Is Always Needed
In the LLM era (2025), transformer-based models handle morphological variation through their subword tokenization and training data. If you’re building a pipeline with BERT, GPT, or a sentence transformer, you typically skip both stemming and lemmatization entirely — the model handles it internally.
These techniques remain relevant for:
- Classic ML pipelines (TF-IDF + logistic regression)
- Search engine preprocessing
- Rule-based NLP tools
- Resource-constrained environments