Lemmatization in NLP
Lemmatization finds the dictionary form (lemma) of a word. Unlike stemming, which chops off endings using rules, lemmatization uses vocabulary knowledge to return a real, meaningful word.
What Is a Lemma?
A lemma is the canonical form of a word โ the form youโd find in a dictionary:
ran โ run (verb lemma)geese โ goose (noun lemma)better โ good (adjective lemma)was โ be (auxiliary verb lemma)This matters because NLP models should understand that ran, runs, and running all express the same underlying concept.
Why POS Tagging Matters for Lemmatization
The same word can have different lemmas depending on its part of speech. Without a POS tag, the lemmatizer guesses โ usually incorrectly.
from nltk.stem import WordNetLemmatizerimport nltknltk.download('wordnet')
lem = WordNetLemmatizer()
# "meeting" as a noun vs verbprint(lem.lemmatize("meeting", pos='n')) # โ meeting (noun: a scheduled gathering)print(lem.lemmatize("meeting", pos='v')) # โ meet (verb: the action)
# "caring" as adjective vs verbprint(lem.lemmatize("caring", pos='a')) # โ caring (adjective)print(lem.lemmatize("caring", pos='v')) # โ care (verb)NLTK Lemmatizer with Automatic POS Tagging
To get accurate results without manually labeling POS tags, combine NLTKโs POS tagger with the lemmatizer:
import nltkfrom nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger_eng')nltk.download('wordnet')
def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV return wordnet.NOUN # default
lem = WordNetLemmatizer()sentence = "The scientists were studying rapidly changing climate patterns."tokens = nltk.word_tokenize(sentence)tagged = nltk.pos_tag(tokens)
lemmas = [lem.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged]print(lemmas)# ['The', 'scientist', 'be', 'study', 'rapidly', 'change', 'climate', 'pattern', '.']spaCy Lemmatization (Recommended for Production)
spaCy handles POS tagging automatically, making lemmatization much simpler:
import spacynlp = spacy.load("en_core_web_sm")
text = "The children were happily playing in the parks near their schools."doc = nlp(text)
for token in doc: if not token.is_punct and not token.is_space: print(f"{token.text:12} โ {token.lemma_}")
# The โ the# children โ child# were โ be# happily โ happily# playing โ play# parks โ park# schools โ schoolLemmatization for Different Languages
spaCy supports lemmatization for over 20 languages with language-specific models:
# German lemmatizationnlp_de = spacy.load("de_core_news_sm")doc_de = nlp_de("Die Kinder spielten im Garten.")for token in doc_de: print(f"{token.text} โ {token.lemma_}")
# French with Stanzaimport stanzastanza.download('fr')nlp_fr = stanza.Pipeline('fr')doc_fr = nlp_fr("Les enfants jouaient dans le jardin.")for sentence in doc_fr.sentences: for word in sentence.words: print(f"{word.text} โ {word.lemma}")Practical Impact on Text Classification
Lemmatization improves models that use bag-of-words or TF-IDF features because it merges word variants that carry the same meaning:
from sklearn.feature_extraction.text import TfidfVectorizerimport spacy
nlp = spacy.load("en_core_web_sm")
def lemmatize_text(text): doc = nlp(text.lower()) return " ".join([token.lemma_ for token in doc if not token.is_stop and not token.is_punct])
raw_texts = [ "The engineers were designing innovative solutions.", "An engineer designs innovative solutions efficiently.", "Engineering innovation has reshaped modern industries."]
lemmatized = [lemmatize_text(t) for t in raw_texts]for orig, lem in zip(raw_texts, lemmatized): print(f"Original: {orig}") print(f"Lemmatized: {lem}\n")
vectorizer = TfidfVectorizer()matrix = vectorizer.fit_transform(lemmatized)print("Vocabulary:", vectorizer.get_feature_names_out())When to Skip Lemmatization
With transformer-based models (BERT, RoBERTa, sentence-transformers), lemmatization is usually unnecessary. These models learn morphological relationships during pretraining and are often more accurate when given the original text. Lemmatizing before passing to a transformer may actually hurt performance by removing information the model could use.
Stick with lemmatization for:
- TF-IDF vectorization pipelines
- Classic ML feature engineering
- Search index preprocessing
- Applications requiring interpretable, clean text output