Stopword Removal in NLP
Stopwords are high-frequency words that carry little semantic weight on their own — “the”, “is”, “and”, “at”, “which”. Removing them reduces noise in classic NLP pipelines. But the decision isn’t always straightforward.
What Are Stopwords?
Stopwords are words that appear so often across all texts that they don’t help distinguish one document from another. In a topic classification task, “the” appears in every article equally, contributing no signal.
import nltknltk.download('stopwords')from nltk.corpus import stopwords
english_stops = stopwords.words('english')print(english_stops[:20])# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',# "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',# 'yourselves', 'he', 'him', 'his']print(f"Total English stopwords in NLTK: {len(english_stops)}") # 179Stopword Removal with NLTK
from nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsimport nltk
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))
text = "The latest AI models are transforming how we process and understand natural language."tokens = word_tokenize(text.lower())filtered = [word for word in tokens if word.isalpha() and word not in stop_words]
print("Original tokens:", tokens)print("After removal: ", filtered)# After removal: ['latest', 'ai', 'models', 'transforming', 'process', 'understand', 'natural', 'language']Stopword Removal with spaCy
spaCy marks stopwords via token.is_stop:
import spacynlp = spacy.load("en_core_web_sm")
text = "The researchers published a groundbreaking study on large language models."doc = nlp(text)
content_words = [token.text for token in doc if not token.is_stop and not token.is_punct]print(content_words)# ['researchers', 'published', 'groundbreaking', 'study', 'large', 'language', 'models']Adding Custom Stopwords
Domain-specific text often contains words that are common but meaningless in context — journal names, legal boilerplate, company names appearing in every document:
import spacynlp = spacy.load("en_core_web_sm")
# Add custom stopwords for a legal document pipelinecustom_stops = {"hereby", "herein", "thereof", "aforementioned", "pursuant"}for word in custom_stops: nlp.vocab[word].is_stop = True
# Or remove a default stopword that matters in your domainnlp.vocab["not"].is_stop = False # "not" can be critical in sentiment analysis!Multilingual Stopwords
# NLTK supports 23 languageslanguages = ['english', 'french', 'german', 'spanish', 'portuguese', 'italian', 'dutch', 'arabic', 'chinese']for lang in languages: stops = stopwords.words(lang) print(f"{lang:12}: {len(stops)} stopwords")
# Spanish examplespanish_stops = set(stopwords.words('spanish'))text_es = "Los modelos de lenguaje han transformado el procesamiento del texto."tokens_es = word_tokenize(text_es.lower())filtered_es = [w for w in tokens_es if w.isalpha() and w not in spanish_stops]print(filtered_es)# ['modelos', 'lenguaje', 'han', 'transformado', 'procesamiento', 'texto']When NOT to Remove Stopwords
This is where most tutorials go wrong. Stopword removal is not universally beneficial:
Sentiment analysis — “not good” becomes “good” without stopwords. The negation is lost.
Question answering — “Who is the president of France?” loses meaning as “president France”.
Machine translation — grammatical function words are essential for correct syntax in the target language.
Transformer models — BERT, GPT, and similar models were pretrained on full text including stopwords. Removing them before passing to these models degrades performance.
Named entity recognition — “the United Nations” — removing “the” is fine, but it can affect chunking.
Stopwords in Modern NLP Pipelines
In 2025, stopword removal is primarily used in:
- TF-IDF + classical ML classifiers where vocabulary size matters
- Search index building to reduce index size and match on content words
- Topic modeling (LDA, NMF) to improve topic quality
- Text summarization preprocessing for extractive methods
If you’re using transformer-based models end-to-end, skip this step.
Complete Pipeline Example
import spacyfrom collections import Counter
nlp = spacy.load("en_core_web_sm")
def extract_content_words(text): doc = nlp(text.lower()) return [ token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_space and token.is_alpha ]
documents = [ "Natural language processing enables machines to understand human text.", "Deep learning models process text more effectively than rule-based systems.", "Transformers revolutionized natural language understanding tasks."]
for doc_text in documents: words = extract_content_words(doc_text) print(words)
# ['natural', 'language', 'processing', 'enable', 'machine', 'understand', 'human', 'text']# ['deep', 'learn', 'model', 'process', 'text', 'effectively', 'rule', 'base', 'system']# ['transformer', 'revolutionize', 'natural', 'language', 'understand', 'task']