Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Text Normalization in NLP

Text normalization converts raw, inconsistent text into a clean, standardized form before feeding it to a model or pipeline. What looks like a minor formatting difference to a human — “USA” vs “usa”, “don’t” vs “do not” — can cause significant problems for downstream models.


Why Normalization Matters

Without normalization:


Core Normalization Steps

1. Lowercasing

text = "Natural Language Processing and Large Language Models are Transforming AI."
normalized = text.lower()
print(normalized)
# "natural language processing and large language models are transforming ai."

Caution: Lowercasing loses information. “Apple” (company) and “apple” (fruit) become identical. Skip it for NER, coreference, and transformer models that encode case.


2. Punctuation and Special Character Removal

import re
def remove_punctuation(text):
return re.sub(r'[^\w\s]', '', text)
text = "Hello, world! This is NLP... isn't it amazing?"
print(remove_punctuation(text))
# "Hello world This is NLP isnt it amazing"

For a lighter touch that keeps sentence-boundary periods:

def clean_text(text):
text = re.sub(r'[^a-zA-Z0-9\s.,!?]', '', text)
return text.strip()

3. Whitespace Normalization

def normalize_whitespace(text):
text = re.sub(r'\s+', ' ', text) # multiple spaces → single space
text = re.sub(r'\n+', ' ', text) # newlines → space
return text.strip()
text = "This has \n\n extra whitespace "
print(normalize_whitespace(text))
# "This has extra whitespace"

4. Contraction Expansion

# pip install contractions
import contractions
def expand_contractions(text):
return contractions.fix(text)
text = "I'm not sure it's going to work, but I'll try."
print(expand_contractions(text))
# "I am not sure it is going to work, but I will try."

5. HTML and URL Removal

from bs4 import BeautifulSoup
import re
def clean_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
def remove_urls(text):
return re.sub(r'https?://\S+|www\.\S+', '', text)
html = "<p>Check out <a href='https://example.com'>this link</a> for more info!</p>"
print(remove_urls(clean_html(html)))
# "Check out this link for more info!"

6. Unicode Normalization

Unicode represents some characters in multiple ways. Normalizing ensures consistency:

import unicodedata
def normalize_unicode(text):
# NFC: canonical decomposition followed by canonical composition
return unicodedata.normalize('NFC', text)
def remove_accents(text):
nfkd = unicodedata.normalize('NFKD', text)
return ''.join(c for c in nfkd if not unicodedata.combining(c))
print(remove_accents("Héllo Wörld")) # "Hello World"
print(remove_accents("café résumé")) # "cafe resume"

7. Number Handling

def normalize_numbers(text, strategy='keep'):
if strategy == 'remove':
return re.sub(r'\b\d+\b', '', text)
elif strategy == 'replace':
return re.sub(r'\b\d+\b', 'NUM', text)
return text
text = "The dataset contains 1.2 million records from 2025."
print(normalize_numbers(text, 'replace'))
# "The dataset contains NUM million records from NUM."

Complete Normalization Pipeline

import re
import contractions
import unicodedata
from bs4 import BeautifulSoup
def normalize_text(text, lowercase=True, remove_html=True, expand=True):
if remove_html:
text = BeautifulSoup(text, "html.parser").get_text()
text = re.sub(r'https?://\S+|www\.\S+', '', text)
if expand:
text = contractions.fix(text)
text = unicodedata.normalize('NFC', text)
if lowercase:
text = text.lower()
text = re.sub(r'[^a-z0-9\s.,!?\'"-]' if lowercase else r'[^a-zA-Z0-9\s.,!?\'"-]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
# Usage
raw = "<p>Don't miss the latest <b>AI</b> breakthrough at https://example.com! It's revolutionary.</p>"
print(normalize_text(raw))
# "do not miss the latest ai breakthrough at. it is revolutionary."

When to Skip Normalization Steps

StepSkip when
LowercasingTransformer models (BERT, GPT) — they encode case as a feature
Contraction expansionConversational models — contractions signal informal register
Punctuation removalSentence segmentation, grammar analysis
Accent removalMultilingual tasks — accents are meaningful
Number removalFinancial analysis, scientific text

Normalization is a preprocessing choice, not a universal requirement. Transformer-based models often perform best on minimally processed text since they were pretrained on raw web data.