Stopword Removal in NLP

Stopwords are high-frequency words that carry little semantic weight on their own — “the”, “is”, “and”, “at”, “which”. Removing them reduces noise in classic NLP pipelines. But the decision isn’t always straightforward.

What Are Stopwords?

Stopwords are words that appear so often across all texts that they don’t help distinguish one document from another. In a topic classification task, “the” appears in every article equally, contributing no signal.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

english_stops = stopwords.words('english')
print(english_stops[:20])
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
#  "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
#  'yourselves', 'he', 'him', 'his']
print(f"Total English stopwords in NLTK: {len(english_stops)}")  # 179

Stopword Removal with NLTK

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt_tab')

stop_words = set(stopwords.words('english'))

text = "The latest AI models are transforming how we process and understand natural language."
tokens = word_tokenize(text.lower())
filtered = [word for word in tokens if word.isalpha() and word not in stop_words]

print("Original tokens:", tokens)
print("After removal:  ", filtered)
# After removal: ['latest', 'ai', 'models', 'transforming', 'process', 'understand', 'natural', 'language']

Stopword Removal with spaCy

spaCy marks stopwords via token.is_stop:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The researchers published a groundbreaking study on large language models."
doc = nlp(text)

content_words = [token.text for token in doc
                 if not token.is_stop and not token.is_punct]
print(content_words)
# ['researchers', 'published', 'groundbreaking', 'study', 'large', 'language', 'models']

Adding Custom Stopwords

Domain-specific text often contains words that are common but meaningless in context — journal names, legal boilerplate, company names appearing in every document:

import spacy
nlp = spacy.load("en_core_web_sm")

# Add custom stopwords for a legal document pipeline
custom_stops = {"hereby", "herein", "thereof", "aforementioned", "pursuant"}
for word in custom_stops:
    nlp.vocab[word].is_stop = True

# Or remove a default stopword that matters in your domain
nlp.vocab["not"].is_stop = False   # "not" can be critical in sentiment analysis!

Multilingual Stopwords

# NLTK supports 23 languages
languages = ['english', 'french', 'german', 'spanish', 'portuguese',
             'italian', 'dutch', 'arabic', 'chinese']
for lang in languages:
    stops = stopwords.words(lang)
    print(f"{lang:12}: {len(stops)} stopwords")

# Spanish example
spanish_stops = set(stopwords.words('spanish'))
text_es = "Los modelos de lenguaje han transformado el procesamiento del texto."
tokens_es = word_tokenize(text_es.lower())
filtered_es = [w for w in tokens_es if w.isalpha() and w not in spanish_stops]
print(filtered_es)
# ['modelos', 'lenguaje', 'han', 'transformado', 'procesamiento', 'texto']

When NOT to Remove Stopwords

This is where most tutorials go wrong. Stopword removal is not universally beneficial:

Sentiment analysis — “not good” becomes “good” without stopwords. The negation is lost.

Question answering — “Who is the president of France?” loses meaning as “president France”.

Machine translation — grammatical function words are essential for correct syntax in the target language.

Transformer models — BERT, GPT, and similar models were pretrained on full text including stopwords. Removing them before passing to these models degrades performance.

Named entity recognition — “the United Nations” — removing “the” is fine, but it can affect chunking.

Stopwords in Modern NLP Pipelines

In 2025, stopword removal is primarily used in:

TF-IDF + classical ML classifiers where vocabulary size matters
Search index building to reduce index size and match on content words
Topic modeling (LDA, NMF) to improve topic quality
Text summarization preprocessing for extractive methods

If you’re using transformer-based models end-to-end, skip this step.

Complete Pipeline Example

import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

def extract_content_words(text):
    doc = nlp(text.lower())
    return [
        token.lemma_
        for token in doc
        if not token.is_stop
        and not token.is_punct
        and not token.is_space
        and token.is_alpha
    ]

documents = [
    "Natural language processing enables machines to understand human text.",
    "Deep learning models process text more effectively than rule-based systems.",
    "Transformers revolutionized natural language understanding tasks."
]

for doc_text in documents:
    words = extract_content_words(doc_text)
    print(words)

# ['natural', 'language', 'processing', 'enable', 'machine', 'understand', 'human', 'text']
# ['deep', 'learn', 'model', 'process', 'text', 'effectively', 'rule', 'base', 'system']
# ['transformer', 'revolutionize', 'natural', 'language', 'understand', 'task']