Text Normalization in NLP
Text normalization converts raw, inconsistent text into a clean, standardized form before feeding it to a model or pipeline. What looks like a minor formatting difference to a human — “USA” vs “usa”, “don’t” vs “do not” — can cause significant problems for downstream models.
Why Normalization Matters
Without normalization:
- “GPT-4”, “gpt4”, “GPT 4” might be treated as three different tokens
- “I’m”, “I am”, “i am” look like different phrases to a bag-of-words model
- HTML tags, emoji, and URL noise corrupt feature representations
- Duplicate whitespace and mixed newlines create inconsistent tokenization
Core Normalization Steps
1. Lowercasing
text = "Natural Language Processing and Large Language Models are Transforming AI."normalized = text.lower()print(normalized)# "natural language processing and large language models are transforming ai."Caution: Lowercasing loses information. “Apple” (company) and “apple” (fruit) become identical. Skip it for NER, coreference, and transformer models that encode case.
2. Punctuation and Special Character Removal
import re
def remove_punctuation(text): return re.sub(r'[^\w\s]', '', text)
text = "Hello, world! This is NLP... isn't it amazing?"print(remove_punctuation(text))# "Hello world This is NLP isnt it amazing"For a lighter touch that keeps sentence-boundary periods:
def clean_text(text): text = re.sub(r'[^a-zA-Z0-9\s.,!?]', '', text) return text.strip()3. Whitespace Normalization
def normalize_whitespace(text): text = re.sub(r'\s+', ' ', text) # multiple spaces → single space text = re.sub(r'\n+', ' ', text) # newlines → space return text.strip()
text = "This has \n\n extra whitespace "print(normalize_whitespace(text))# "This has extra whitespace"4. Contraction Expansion
# pip install contractionsimport contractions
def expand_contractions(text): return contractions.fix(text)
text = "I'm not sure it's going to work, but I'll try."print(expand_contractions(text))# "I am not sure it is going to work, but I will try."5. HTML and URL Removal
from bs4 import BeautifulSoupimport re
def clean_html(text): soup = BeautifulSoup(text, "html.parser") return soup.get_text()
def remove_urls(text): return re.sub(r'https?://\S+|www\.\S+', '', text)
html = "<p>Check out <a href='https://example.com'>this link</a> for more info!</p>"print(remove_urls(clean_html(html)))# "Check out this link for more info!"6. Unicode Normalization
Unicode represents some characters in multiple ways. Normalizing ensures consistency:
import unicodedata
def normalize_unicode(text): # NFC: canonical decomposition followed by canonical composition return unicodedata.normalize('NFC', text)
def remove_accents(text): nfkd = unicodedata.normalize('NFKD', text) return ''.join(c for c in nfkd if not unicodedata.combining(c))
print(remove_accents("Héllo Wörld")) # "Hello World"print(remove_accents("café résumé")) # "cafe resume"7. Number Handling
def normalize_numbers(text, strategy='keep'): if strategy == 'remove': return re.sub(r'\b\d+\b', '', text) elif strategy == 'replace': return re.sub(r'\b\d+\b', 'NUM', text) return text
text = "The dataset contains 1.2 million records from 2025."print(normalize_numbers(text, 'replace'))# "The dataset contains NUM million records from NUM."Complete Normalization Pipeline
import reimport contractionsimport unicodedatafrom bs4 import BeautifulSoup
def normalize_text(text, lowercase=True, remove_html=True, expand=True): if remove_html: text = BeautifulSoup(text, "html.parser").get_text()
text = re.sub(r'https?://\S+|www\.\S+', '', text)
if expand: text = contractions.fix(text)
text = unicodedata.normalize('NFC', text)
if lowercase: text = text.lower()
text = re.sub(r'[^a-z0-9\s.,!?\'"-]' if lowercase else r'[^a-zA-Z0-9\s.,!?\'"-]', '', text) text = re.sub(r'\s+', ' ', text).strip()
return text
# Usageraw = "<p>Don't miss the latest <b>AI</b> breakthrough at https://example.com! It's revolutionary.</p>"print(normalize_text(raw))# "do not miss the latest ai breakthrough at. it is revolutionary."When to Skip Normalization Steps
| Step | Skip when |
|---|---|
| Lowercasing | Transformer models (BERT, GPT) — they encode case as a feature |
| Contraction expansion | Conversational models — contractions signal informal register |
| Punctuation removal | Sentence segmentation, grammar analysis |
| Accent removal | Multilingual tasks — accents are meaningful |
| Number removal | Financial analysis, scientific text |
Normalization is a preprocessing choice, not a universal requirement. Transformer-based models often perform best on minimally processed text since they were pretrained on raw web data.