Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Bag of Words (BoW) in NLP

Bag of Words is one of the simplest ways to convert text into numbers. It counts how many times each word appears in a document, throwing away word order, grammar, and context. Despite its simplicity, it’s still useful for many text classification tasks.


The Core Idea

BoW represents a document as a vector of word counts. Each unique word in the vocabulary gets a position, and the value at that position is how many times the word appears.

Corpus:
Doc 1: "the cat sat on the mat"
Doc 2: "the cat sat on the hat"
Doc 3: "the cat in the hat"
Vocabulary: [cat, hat, in, mat, on, sat, the]
Doc 1: [1, 0, 0, 1, 1, 1, 2]
Doc 2: [1, 1, 0, 0, 1, 1, 2]
Doc 3: [1, 1, 1, 0, 0, 0, 2]

Word order is ignored — that’s the “bag” metaphor. The same bag, regardless of arrangement.


Building BoW with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
corpus = [
"Natural language processing enables machines to understand human text.",
"Deep learning models have transformed natural language processing.",
"Text classification is a fundamental natural language processing task.",
"Machines can now generate natural language with remarkable fluency."
]
vectorizer = CountVectorizer(stop_words='english', lowercase=True)
X = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names_out()
df = pd.DataFrame(X.toarray(), columns=vocab)
print(df)

Controlling Vocabulary Size

For real datasets, uncontrolled vocabulary size leads to very sparse, high-dimensional matrices:

vectorizer = CountVectorizer(
stop_words='english',
max_features=1000, # keep top 1000 words by frequency
min_df=2, # ignore words appearing in fewer than 2 documents
max_df=0.95, # ignore words appearing in more than 95% of documents
ngram_range=(1, 2) # include unigrams and bigrams
)
X = vectorizer.fit_transform(corpus)
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Matrix shape: {X.shape}")

Text Classification with BoW

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Sample dataset
texts = [
"The GPU accelerated the model training significantly",
"The recipe requires three cups of flour and two eggs",
"Backpropagation updates neural network weights iteratively",
"Season the pasta with olive oil and fresh basil",
"Transformer attention mechanisms process tokens in parallel",
"Bake the cake at 350 degrees for 30 minutes"
]
labels = ["tech", "food", "tech", "food", "tech", "food"]
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)
pipeline = Pipeline([
('vectorizer', CountVectorizer(stop_words='english')),
('classifier', MultinomialNB())
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
print(classification_report(y_test, preds))

Binary BoW

Instead of raw counts, use presence/absence (0 or 1):

vectorizer_binary = CountVectorizer(binary=True, stop_words='english')
X_binary = vectorizer_binary.fit_transform(corpus)

Binary BoW works better when the number of occurrences is less informative than simple presence — common in short documents like tweets or headlines.


BoW with N-grams

Adding bigrams and trigrams captures some context that single words miss:

vectorizer_ngram = CountVectorizer(
ngram_range=(1, 3), # unigrams, bigrams, trigrams
stop_words='english',
max_features=5000
)
X_ngram = vectorizer_ngram.fit_transform(corpus)
print(f"N-gram vocabulary size: {X_ngram.shape[1]}")
# Inspect some bigrams
features = vectorizer_ngram.get_feature_names_out()
bigrams = [f for f in features if ' ' in f]
print("Sample bigrams:", bigrams[:10])

Limitations of Bag of Words

No word order — “dog bites man” and “man bites dog” produce the same vector.

No semantics — “happy”, “joyful”, and “glad” appear as completely different, unrelated words.

High dimensionality — vocabularies easily reach 50,000+ words, creating massive sparse matrices.

Rare word problem — words appearing only once or twice are noise.


BoW vs Modern Alternatives

MethodOrderSemanticsDimensionSpeed
BoWNoNoHigh (sparse)Fast
TF-IDFNoPartialHigh (sparse)Fast
Word2VecNoYesLow (dense)Medium
BERT embeddingsYesYesLow (dense)Slow

In 2025, Bag of Words is still widely used for: