Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Stopword Removal in NLP

Stopwords are high-frequency words that carry little semantic weight on their own — “the”, “is”, “and”, “at”, “which”. Removing them reduces noise in classic NLP pipelines. But the decision isn’t always straightforward.


What Are Stopwords?

Stopwords are words that appear so often across all texts that they don’t help distinguish one document from another. In a topic classification task, “the” appears in every article equally, contributing no signal.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
english_stops = stopwords.words('english')
print(english_stops[:20])
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
# "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
# 'yourselves', 'he', 'him', 'his']
print(f"Total English stopwords in NLTK: {len(english_stops)}") # 179

Stopword Removal with NLTK

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))
text = "The latest AI models are transforming how we process and understand natural language."
tokens = word_tokenize(text.lower())
filtered = [word for word in tokens if word.isalpha() and word not in stop_words]
print("Original tokens:", tokens)
print("After removal: ", filtered)
# After removal: ['latest', 'ai', 'models', 'transforming', 'process', 'understand', 'natural', 'language']

Stopword Removal with spaCy

spaCy marks stopwords via token.is_stop:

import spacy
nlp = spacy.load("en_core_web_sm")
text = "The researchers published a groundbreaking study on large language models."
doc = nlp(text)
content_words = [token.text for token in doc
if not token.is_stop and not token.is_punct]
print(content_words)
# ['researchers', 'published', 'groundbreaking', 'study', 'large', 'language', 'models']

Adding Custom Stopwords

Domain-specific text often contains words that are common but meaningless in context — journal names, legal boilerplate, company names appearing in every document:

import spacy
nlp = spacy.load("en_core_web_sm")
# Add custom stopwords for a legal document pipeline
custom_stops = {"hereby", "herein", "thereof", "aforementioned", "pursuant"}
for word in custom_stops:
nlp.vocab[word].is_stop = True
# Or remove a default stopword that matters in your domain
nlp.vocab["not"].is_stop = False # "not" can be critical in sentiment analysis!

Multilingual Stopwords

# NLTK supports 23 languages
languages = ['english', 'french', 'german', 'spanish', 'portuguese',
'italian', 'dutch', 'arabic', 'chinese']
for lang in languages:
stops = stopwords.words(lang)
print(f"{lang:12}: {len(stops)} stopwords")
# Spanish example
spanish_stops = set(stopwords.words('spanish'))
text_es = "Los modelos de lenguaje han transformado el procesamiento del texto."
tokens_es = word_tokenize(text_es.lower())
filtered_es = [w for w in tokens_es if w.isalpha() and w not in spanish_stops]
print(filtered_es)
# ['modelos', 'lenguaje', 'han', 'transformado', 'procesamiento', 'texto']

When NOT to Remove Stopwords

This is where most tutorials go wrong. Stopword removal is not universally beneficial:

Sentiment analysis — “not good” becomes “good” without stopwords. The negation is lost.

Question answering — “Who is the president of France?” loses meaning as “president France”.

Machine translation — grammatical function words are essential for correct syntax in the target language.

Transformer models — BERT, GPT, and similar models were pretrained on full text including stopwords. Removing them before passing to these models degrades performance.

Named entity recognition — “the United Nations” — removing “the” is fine, but it can affect chunking.


Stopwords in Modern NLP Pipelines

In 2025, stopword removal is primarily used in:

If you’re using transformer-based models end-to-end, skip this step.


Complete Pipeline Example

import spacy
from collections import Counter
nlp = spacy.load("en_core_web_sm")
def extract_content_words(text):
doc = nlp(text.lower())
return [
token.lemma_
for token in doc
if not token.is_stop
and not token.is_punct
and not token.is_space
and token.is_alpha
]
documents = [
"Natural language processing enables machines to understand human text.",
"Deep learning models process text more effectively than rule-based systems.",
"Transformers revolutionized natural language understanding tasks."
]
for doc_text in documents:
words = extract_content_words(doc_text)
print(words)
# ['natural', 'language', 'processing', 'enable', 'machine', 'understand', 'human', 'text']
# ['deep', 'learn', 'model', 'process', 'text', 'effectively', 'rule', 'base', 'system']
# ['transformer', 'revolutionize', 'natural', 'language', 'understand', 'task']