N-grams in NLP

An n-gram is a contiguous sequence of n items from a piece of text. N-grams capture local context that a single-word (unigram) model misses — “New York” is more meaningful as a unit than “New” and “York” separately.

Types of N-grams

Name	n	Example from “the quick brown fox”
Unigram	1	”the”, “quick”, “brown”, “fox”
Bigram	2	”the quick”, “quick brown”, “brown fox”
Trigram	3	”the quick brown”, “quick brown fox”
4-gram	4	”the quick brown fox”

Character n-grams split on characters rather than words:

“quick” → char bigrams: “qu”, “ui”, “ic”, “ck”

Generating N-grams with NLTK

from nltk.util import ngrams
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')

text = "Large language models transform natural language processing in remarkable ways."
tokens = word_tokenize(text.lower())

# Bigrams
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams[:5])
# [('large', 'language'), ('language', 'models'), ('models', 'transform'),
#  ('transform', 'natural'), ('natural', 'language')]

# Trigrams
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams[:4])
# [('large', 'language', 'models'), ('language', 'models', 'transform'), ...]

# Frequency distribution
from nltk import FreqDist
bigram_freq = FreqDist(bigrams)
print("Most common bigrams:", bigram_freq.most_common(5))

N-grams with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "machine learning models process text data efficiently",
    "deep learning transforms natural language processing tasks",
    "language models generate fluent and coherent text",
    "neural networks learn representations from large datasets"
]

# Bigrams only
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')
X_bigrams = bigram_vectorizer.fit_transform(corpus)

print("Bigram features:")
print(bigram_vectorizer.get_feature_names_out())

# Mixed: unigrams + bigrams + trigrams
mixed_vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words='english', max_features=50)
X_mixed = mixed_vectorizer.fit_transform(corpus)
print(f"\nMixed n-gram features: {X_mixed.shape[1]}")

Character N-grams

Character n-grams are especially powerful for:

Misspelling and typo tolerance
Subword morphology (detecting “unhappy”, “happiness”, “happily” share “happi”)
Language identification
Handling out-of-vocabulary words

from sklearn.feature_extraction.text import CountVectorizer

char_vectorizer = CountVectorizer(
    analyzer='char_wb',  # char n-grams, respecting word boundaries
    ngram_range=(3, 5),
    max_features=200
)

texts = ["tokenization", "tokenizing", "tokenizer", "untokenized"]
X = char_vectorizer.fit_transform(texts)

features = char_vectorizer.get_feature_names_out()
print("Character n-gram features:", features[:15])

N-gram Language Model

A language model assigns probabilities to sequences of words. An n-gram language model estimates the probability of the next word based on the previous n-1 words:

from nltk.lm import MLE, Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

text = """
Natural language processing is a field of AI. Language models learn patterns
from text data. Modern NLP systems use deep learning. Transformers have
revolutionized language understanding and generation.
"""

sentences = [word_tokenize(s.lower()) for s in sent_tokenize(text)]
n = 3  # trigram model

train_data, vocab = padded_everygram_pipeline(n, sentences)

# Train an MLE trigram model
model = MLE(n)
model.fit(train_data, vocab)

# Score a sequence
print(model.score("language", ["natural", "language"]))  # P("language" | "natural language")

# Generate text
generated = list(model.generate(10, random_seed=42))
print("Generated:", ' '.join(generated))

Practical Applications

Sentiment analysis features — bigrams like “not good” or “very fast” carry more sentiment than individual words.

Spam detection — character 4-grams catch obfuscation tricks like “V1agra” and “Vi@gra”.

Autocomplete — suggest the next word based on the most probable n-gram completion.

Plagiarism detection — compare character n-gram overlap between documents.

Language identification — character bigram/trigram distributions are language-specific fingerprints.

Named entity detection preprocessing — bigrams like “New York”, “Los Angeles” are more likely entities than either word alone.

N-gram Perplexity

Perplexity measures how well a language model predicts a test set. Lower perplexity = better model:

from nltk.lm import MLE, Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline

# Laplace smoothing handles unseen n-grams (avoids zero probability)
model_laplace = Laplace(2)
train_data, vocab = padded_everygram_pipeline(2, sentences)
model_laplace.fit(train_data, vocab)

test_sentences = [word_tokenize("language models process text")]
test_data = [list(ngrams(s, 2)) for s in test_sentences]
perplexity = model_laplace.perplexity(test_data[0])
print(f"Perplexity: {perplexity:.2f}")

N-gram language models have been largely superseded by neural LMs (GPT, BERT), but they remain useful for lightweight, interpretable applications where a large model isn’t justified.