Lemmatization in NLP

Lemmatization finds the dictionary form (lemma) of a word. Unlike stemming, which chops off endings using rules, lemmatization uses vocabulary knowledge to return a real, meaningful word.

What Is a Lemma?

A lemma is the canonical form of a word — the form you’d find in a dictionary:

ran    → run    (verb lemma)
geese  → goose  (noun lemma)
better → good   (adjective lemma)
was    → be     (auxiliary verb lemma)

This matters because NLP models should understand that ran, runs, and running all express the same underlying concept.

Why POS Tagging Matters for Lemmatization

The same word can have different lemmas depending on its part of speech. Without a POS tag, the lemmatizer guesses — usually incorrectly.

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lem = WordNetLemmatizer()

# "meeting" as a noun vs verb
print(lem.lemmatize("meeting", pos='n'))  # → meeting  (noun: a scheduled gathering)
print(lem.lemmatize("meeting", pos='v'))  # → meet     (verb: the action)

# "caring" as adjective vs verb
print(lem.lemmatize("caring", pos='a'))   # → caring   (adjective)
print(lem.lemmatize("caring", pos='v'))   # → care     (verb)

NLTK Lemmatizer with Automatic POS Tagging

To get accurate results without manually labeling POS tags, combine NLTK’s POS tagger with the lemmatizer:

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN  # default

lem = WordNetLemmatizer()
sentence = "The scientists were studying rapidly changing climate patterns."
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)

lemmas = [lem.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]
print(lemmas)
# ['The', 'scientist', 'be', 'study', 'rapidly', 'change', 'climate', 'pattern', '.']

spaCy Lemmatization (Recommended for Production)

spaCy handles POS tagging automatically, making lemmatization much simpler:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The children were happily playing in the parks near their schools."
doc = nlp(text)

for token in doc:
    if not token.is_punct and not token.is_space:
        print(f"{token.text:12} → {token.lemma_}")

# The          → the
# children     → child
# were         → be
# happily      → happily
# playing      → play
# parks        → park
# schools      → school

Lemmatization for Different Languages

spaCy supports lemmatization for over 20 languages with language-specific models:

# German lemmatization
nlp_de = spacy.load("de_core_news_sm")
doc_de = nlp_de("Die Kinder spielten im Garten.")
for token in doc_de:
    print(f"{token.text} → {token.lemma_}")

# French with Stanza
import stanza
stanza.download('fr')
nlp_fr = stanza.Pipeline('fr')
doc_fr = nlp_fr("Les enfants jouaient dans le jardin.")
for sentence in doc_fr.sentences:
    for word in sentence.words:
        print(f"{word.text} → {word.lemma}")

Practical Impact on Text Classification

Lemmatization improves models that use bag-of-words or TF-IDF features because it merges word variants that carry the same meaning:

from sklearn.feature_extraction.text import TfidfVectorizer
import spacy

nlp = spacy.load("en_core_web_sm")

def lemmatize_text(text):
    doc = nlp(text.lower())
    return " ".join([token.lemma_ for token in doc
                     if not token.is_stop and not token.is_punct])

raw_texts = [
    "The engineers were designing innovative solutions.",
    "An engineer designs innovative solutions efficiently.",
    "Engineering innovation has reshaped modern industries."
]

lemmatized = [lemmatize_text(t) for t in raw_texts]
for orig, lem in zip(raw_texts, lemmatized):
    print(f"Original: {orig}")
    print(f"Lemmatized: {lem}\n")

vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(lemmatized)
print("Vocabulary:", vectorizer.get_feature_names_out())

When to Skip Lemmatization

With transformer-based models (BERT, RoBERTa, sentence-transformers), lemmatization is usually unnecessary. These models learn morphological relationships during pretraining and are often more accurate when given the original text. Lemmatizing before passing to a transformer may actually hurt performance by removing information the model could use.

Stick with lemmatization for:

TF-IDF vectorization pipelines
Classic ML feature engineering
Search index preprocessing
Applications requiring interpretable, clean text output