Bag of Words (BoW) in NLP
Bag of Words is one of the simplest ways to convert text into numbers. It counts how many times each word appears in a document, throwing away word order, grammar, and context. Despite its simplicity, it’s still useful for many text classification tasks.
The Core Idea
BoW represents a document as a vector of word counts. Each unique word in the vocabulary gets a position, and the value at that position is how many times the word appears.
Corpus: Doc 1: "the cat sat on the mat" Doc 2: "the cat sat on the hat" Doc 3: "the cat in the hat"
Vocabulary: [cat, hat, in, mat, on, sat, the]
Doc 1: [1, 0, 0, 1, 1, 1, 2]Doc 2: [1, 1, 0, 0, 1, 1, 2]Doc 3: [1, 1, 1, 0, 0, 0, 2]Word order is ignored — that’s the “bag” metaphor. The same bag, regardless of arrangement.
Building BoW with scikit-learn
from sklearn.feature_extraction.text import CountVectorizerimport pandas as pd
corpus = [ "Natural language processing enables machines to understand human text.", "Deep learning models have transformed natural language processing.", "Text classification is a fundamental natural language processing task.", "Machines can now generate natural language with remarkable fluency."]
vectorizer = CountVectorizer(stop_words='english', lowercase=True)X = vectorizer.fit_transform(corpus)
vocab = vectorizer.get_feature_names_out()df = pd.DataFrame(X.toarray(), columns=vocab)print(df)Controlling Vocabulary Size
For real datasets, uncontrolled vocabulary size leads to very sparse, high-dimensional matrices:
vectorizer = CountVectorizer( stop_words='english', max_features=1000, # keep top 1000 words by frequency min_df=2, # ignore words appearing in fewer than 2 documents max_df=0.95, # ignore words appearing in more than 95% of documents ngram_range=(1, 2) # include unigrams and bigrams)
X = vectorizer.fit_transform(corpus)print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")print(f"Matrix shape: {X.shape}")Text Classification with BoW
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import classification_report
# Sample datasettexts = [ "The GPU accelerated the model training significantly", "The recipe requires three cups of flour and two eggs", "Backpropagation updates neural network weights iteratively", "Season the pasta with olive oil and fresh basil", "Transformer attention mechanisms process tokens in parallel", "Bake the cake at 350 degrees for 30 minutes"]labels = ["tech", "food", "tech", "food", "tech", "food"]
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)
pipeline = Pipeline([ ('vectorizer', CountVectorizer(stop_words='english')), ('classifier', MultinomialNB())])
pipeline.fit(X_train, y_train)preds = pipeline.predict(X_test)print(classification_report(y_test, preds))Binary BoW
Instead of raw counts, use presence/absence (0 or 1):
vectorizer_binary = CountVectorizer(binary=True, stop_words='english')X_binary = vectorizer_binary.fit_transform(corpus)Binary BoW works better when the number of occurrences is less informative than simple presence — common in short documents like tweets or headlines.
BoW with N-grams
Adding bigrams and trigrams captures some context that single words miss:
vectorizer_ngram = CountVectorizer( ngram_range=(1, 3), # unigrams, bigrams, trigrams stop_words='english', max_features=5000)
X_ngram = vectorizer_ngram.fit_transform(corpus)print(f"N-gram vocabulary size: {X_ngram.shape[1]}")
# Inspect some bigramsfeatures = vectorizer_ngram.get_feature_names_out()bigrams = [f for f in features if ' ' in f]print("Sample bigrams:", bigrams[:10])Limitations of Bag of Words
No word order — “dog bites man” and “man bites dog” produce the same vector.
No semantics — “happy”, “joyful”, and “glad” appear as completely different, unrelated words.
High dimensionality — vocabularies easily reach 50,000+ words, creating massive sparse matrices.
Rare word problem — words appearing only once or twice are noise.
BoW vs Modern Alternatives
| Method | Order | Semantics | Dimension | Speed |
|---|---|---|---|---|
| BoW | No | No | High (sparse) | Fast |
| TF-IDF | No | Partial | High (sparse) | Fast |
| Word2Vec | No | Yes | Low (dense) | Medium |
| BERT embeddings | Yes | Yes | Low (dense) | Slow |
In 2025, Bag of Words is still widely used for:
- Baseline models — always build a BoW baseline before trying complex models
- Keyword-based search — BM25 in Elasticsearch is essentially a weighted BoW
- Topic modeling — LDA operates on document-term matrices
- Fast inference at scale — no GPU required, millions of docs per second