Bag of Words (BoW) in NLP

Bag of Words is one of the simplest ways to convert text into numbers. It counts how many times each word appears in a document, throwing away word order, grammar, and context. Despite its simplicity, it’s still useful for many text classification tasks.

The Core Idea

BoW represents a document as a vector of word counts. Each unique word in the vocabulary gets a position, and the value at that position is how many times the word appears.

Corpus:
  Doc 1: "the cat sat on the mat"
  Doc 2: "the cat sat on the hat"
  Doc 3: "the cat in the hat"

Vocabulary: [cat, hat, in, mat, on, sat, the]

Doc 1: [1, 0, 0, 1, 1, 1, 2]
Doc 2: [1, 1, 0, 0, 1, 1, 2]
Doc 3: [1, 1, 1, 0, 0, 0, 2]

Word order is ignored — that’s the “bag” metaphor. The same bag, regardless of arrangement.

Building BoW with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "Natural language processing enables machines to understand human text.",
    "Deep learning models have transformed natural language processing.",
    "Text classification is a fundamental natural language processing task.",
    "Machines can now generate natural language with remarkable fluency."
]

vectorizer = CountVectorizer(stop_words='english', lowercase=True)
X = vectorizer.fit_transform(corpus)

vocab = vectorizer.get_feature_names_out()
df = pd.DataFrame(X.toarray(), columns=vocab)
print(df)

Controlling Vocabulary Size

For real datasets, uncontrolled vocabulary size leads to very sparse, high-dimensional matrices:

vectorizer = CountVectorizer(
    stop_words='english',
    max_features=1000,    # keep top 1000 words by frequency
    min_df=2,             # ignore words appearing in fewer than 2 documents
    max_df=0.95,          # ignore words appearing in more than 95% of documents
    ngram_range=(1, 2)    # include unigrams and bigrams
)

X = vectorizer.fit_transform(corpus)
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
print(f"Matrix shape: {X.shape}")

Text Classification with BoW

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample dataset
texts = [
    "The GPU accelerated the model training significantly",
    "The recipe requires three cups of flour and two eggs",
    "Backpropagation updates neural network weights iteratively",
    "Season the pasta with olive oil and fresh basil",
    "Transformer attention mechanisms process tokens in parallel",
    "Bake the cake at 350 degrees for 30 minutes"
]
labels = ["tech", "food", "tech", "food", "tech", "food"]

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('classifier', MultinomialNB())
])

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
print(classification_report(y_test, preds))

Binary BoW

Instead of raw counts, use presence/absence (0 or 1):

vectorizer_binary = CountVectorizer(binary=True, stop_words='english')
X_binary = vectorizer_binary.fit_transform(corpus)

Binary BoW works better when the number of occurrences is less informative than simple presence — common in short documents like tweets or headlines.

BoW with N-grams

Adding bigrams and trigrams captures some context that single words miss:

vectorizer_ngram = CountVectorizer(
    ngram_range=(1, 3),   # unigrams, bigrams, trigrams
    stop_words='english',
    max_features=5000
)

X_ngram = vectorizer_ngram.fit_transform(corpus)
print(f"N-gram vocabulary size: {X_ngram.shape[1]}")

# Inspect some bigrams
features = vectorizer_ngram.get_feature_names_out()
bigrams = [f for f in features if ' ' in f]
print("Sample bigrams:", bigrams[:10])

Limitations of Bag of Words

No word order — “dog bites man” and “man bites dog” produce the same vector.

No semantics — “happy”, “joyful”, and “glad” appear as completely different, unrelated words.

High dimensionality — vocabularies easily reach 50,000+ words, creating massive sparse matrices.

Rare word problem — words appearing only once or twice are noise.

BoW vs Modern Alternatives

Method	Order	Semantics	Dimension	Speed
BoW	No	No	High (sparse)	Fast
TF-IDF	No	Partial	High (sparse)	Fast
Word2Vec	No	Yes	Low (dense)	Medium
BERT embeddings	Yes	Yes	Low (dense)	Slow

In 2025, Bag of Words is still widely used for:

Baseline models — always build a BoW baseline before trying complex models
Keyword-based search — BM25 in Elasticsearch is essentially a weighted BoW
Topic modeling — LDA operates on document-term matrices
Fast inference at scale — no GPU required, millions of docs per second