Text Normalization in NLP

Text normalization converts raw, inconsistent text into a clean, standardized form before feeding it to a model or pipeline. What looks like a minor formatting difference to a human — “USA” vs “usa”, “don’t” vs “do not” — can cause significant problems for downstream models.

Why Normalization Matters

Without normalization:

“GPT-4”, “gpt4”, “GPT 4” might be treated as three different tokens
“I’m”, “I am”, “i am” look like different phrases to a bag-of-words model
HTML tags, emoji, and URL noise corrupt feature representations
Duplicate whitespace and mixed newlines create inconsistent tokenization

Core Normalization Steps

1. Lowercasing

text = "Natural Language Processing and Large Language Models are Transforming AI."
normalized = text.lower()
print(normalized)
# "natural language processing and large language models are transforming ai."

Caution: Lowercasing loses information. “Apple” (company) and “apple” (fruit) become identical. Skip it for NER, coreference, and transformer models that encode case.

2. Punctuation and Special Character Removal

import re

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

text = "Hello, world! This is NLP... isn't it amazing?"
print(remove_punctuation(text))
# "Hello world This is NLP isnt it amazing"

For a lighter touch that keeps sentence-boundary periods:

def clean_text(text):
    text = re.sub(r'[^a-zA-Z0-9\s.,!?]', '', text)
    return text.strip()

3. Whitespace Normalization

def normalize_whitespace(text):
    text = re.sub(r'\s+', ' ', text)     # multiple spaces → single space
    text = re.sub(r'\n+', ' ', text)     # newlines → space
    return text.strip()

text = "This   has  \n\n extra   whitespace  "
print(normalize_whitespace(text))
# "This has extra whitespace"

4. Contraction Expansion

# pip install contractions
import contractions

def expand_contractions(text):
    return contractions.fix(text)

text = "I'm not sure it's going to work, but I'll try."
print(expand_contractions(text))
# "I am not sure it is going to work, but I will try."

5. HTML and URL Removal

from bs4 import BeautifulSoup
import re

def clean_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_urls(text):
    return re.sub(r'https?://\S+|www\.\S+', '', text)

html = "<p>Check out <a href='https://example.com'>this link</a> for more info!</p>"
print(remove_urls(clean_html(html)))
# "Check out this link for more info!"

6. Unicode Normalization

Unicode represents some characters in multiple ways. Normalizing ensures consistency:

import unicodedata

def normalize_unicode(text):
    # NFC: canonical decomposition followed by canonical composition
    return unicodedata.normalize('NFC', text)

def remove_accents(text):
    nfkd = unicodedata.normalize('NFKD', text)
    return ''.join(c for c in nfkd if not unicodedata.combining(c))

print(remove_accents("Héllo Wörld"))   # "Hello World"
print(remove_accents("café résumé"))    # "cafe resume"

7. Number Handling

def normalize_numbers(text, strategy='keep'):
    if strategy == 'remove':
        return re.sub(r'\b\d+\b', '', text)
    elif strategy == 'replace':
        return re.sub(r'\b\d+\b', 'NUM', text)
    return text

text = "The dataset contains 1.2 million records from 2025."
print(normalize_numbers(text, 'replace'))
# "The dataset contains NUM million records from NUM."

Complete Normalization Pipeline

import re
import contractions
import unicodedata
from bs4 import BeautifulSoup

def normalize_text(text, lowercase=True, remove_html=True, expand=True):
    if remove_html:
        text = BeautifulSoup(text, "html.parser").get_text()

    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    if expand:
        text = contractions.fix(text)

    text = unicodedata.normalize('NFC', text)

    if lowercase:
        text = text.lower()

    text = re.sub(r'[^a-z0-9\s.,!?\'"-]' if lowercase else r'[^a-zA-Z0-9\s.,!?\'"-]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Usage
raw = "<p>Don't miss the latest <b>AI</b> breakthrough at https://example.com! It's revolutionary.</p>"
print(normalize_text(raw))
# "do not miss the latest ai breakthrough at. it is revolutionary."

When to Skip Normalization Steps

Step	Skip when
Lowercasing	Transformer models (BERT, GPT) — they encode case as a feature
Contraction expansion	Conversational models — contractions signal informal register
Punctuation removal	Sentence segmentation, grammar analysis
Accent removal	Multilingual tasks — accents are meaningful
Number removal	Financial analysis, scientific text

Normalization is a preprocessing choice, not a universal requirement. Transformer-based models often perform best on minimally processed text since they were pretrained on raw web data.