N-grams in NLP
An n-gram is a contiguous sequence of n items from a piece of text. N-grams capture local context that a single-word (unigram) model misses — “New York” is more meaningful as a unit than “New” and “York” separately.
Types of N-grams
| Name | n | Example from “the quick brown fox” |
|---|---|---|
| Unigram | 1 | ”the”, “quick”, “brown”, “fox” |
| Bigram | 2 | ”the quick”, “quick brown”, “brown fox” |
| Trigram | 3 | ”the quick brown”, “quick brown fox” |
| 4-gram | 4 | ”the quick brown fox” |
Character n-grams split on characters rather than words:
- “quick” → char bigrams: “qu”, “ui”, “ic”, “ck”
Generating N-grams with NLTK
from nltk.util import ngramsfrom nltk.tokenize import word_tokenizeimport nltknltk.download('punkt_tab')
text = "Large language models transform natural language processing in remarkable ways."tokens = word_tokenize(text.lower())
# Bigramsbigrams = list(ngrams(tokens, 2))print("Bigrams:", bigrams[:5])# [('large', 'language'), ('language', 'models'), ('models', 'transform'),# ('transform', 'natural'), ('natural', 'language')]
# Trigramstrigrams = list(ngrams(tokens, 3))print("Trigrams:", trigrams[:4])# [('large', 'language', 'models'), ('language', 'models', 'transform'), ...]
# Frequency distributionfrom nltk import FreqDistbigram_freq = FreqDist(bigrams)print("Most common bigrams:", bigram_freq.most_common(5))N-grams with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
corpus = [ "machine learning models process text data efficiently", "deep learning transforms natural language processing tasks", "language models generate fluent and coherent text", "neural networks learn representations from large datasets"]
# Bigrams onlybigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')X_bigrams = bigram_vectorizer.fit_transform(corpus)
print("Bigram features:")print(bigram_vectorizer.get_feature_names_out())
# Mixed: unigrams + bigrams + trigramsmixed_vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words='english', max_features=50)X_mixed = mixed_vectorizer.fit_transform(corpus)print(f"\nMixed n-gram features: {X_mixed.shape[1]}")Character N-grams
Character n-grams are especially powerful for:
- Misspelling and typo tolerance
- Subword morphology (detecting “unhappy”, “happiness”, “happily” share “happi”)
- Language identification
- Handling out-of-vocabulary words
from sklearn.feature_extraction.text import CountVectorizer
char_vectorizer = CountVectorizer( analyzer='char_wb', # char n-grams, respecting word boundaries ngram_range=(3, 5), max_features=200)
texts = ["tokenization", "tokenizing", "tokenizer", "untokenized"]X = char_vectorizer.fit_transform(texts)
features = char_vectorizer.get_feature_names_out()print("Character n-gram features:", features[:15])N-gram Language Model
A language model assigns probabilities to sequences of words. An n-gram language model estimates the probability of the next word based on the previous n-1 words:
from nltk.lm import MLE, Laplacefrom nltk.lm.preprocessing import padded_everygram_pipelinefrom nltk.tokenize import word_tokenize, sent_tokenizeimport nltk
text = """Natural language processing is a field of AI. Language models learn patternsfrom text data. Modern NLP systems use deep learning. Transformers haverevolutionized language understanding and generation."""
sentences = [word_tokenize(s.lower()) for s in sent_tokenize(text)]n = 3 # trigram model
train_data, vocab = padded_everygram_pipeline(n, sentences)
# Train an MLE trigram modelmodel = MLE(n)model.fit(train_data, vocab)
# Score a sequenceprint(model.score("language", ["natural", "language"])) # P("language" | "natural language")
# Generate textgenerated = list(model.generate(10, random_seed=42))print("Generated:", ' '.join(generated))Practical Applications
Sentiment analysis features — bigrams like “not good” or “very fast” carry more sentiment than individual words.
Spam detection — character 4-grams catch obfuscation tricks like “V1agra” and “Vi@gra”.
Autocomplete — suggest the next word based on the most probable n-gram completion.
Plagiarism detection — compare character n-gram overlap between documents.
Language identification — character bigram/trigram distributions are language-specific fingerprints.
Named entity detection preprocessing — bigrams like “New York”, “Los Angeles” are more likely entities than either word alone.
N-gram Perplexity
Perplexity measures how well a language model predicts a test set. Lower perplexity = better model:
from nltk.lm import MLE, Laplacefrom nltk.lm.preprocessing import padded_everygram_pipeline
# Laplace smoothing handles unseen n-grams (avoids zero probability)model_laplace = Laplace(2)train_data, vocab = padded_everygram_pipeline(2, sentences)model_laplace.fit(train_data, vocab)
test_sentences = [word_tokenize("language models process text")]test_data = [list(ngrams(s, 2)) for s in test_sentences]perplexity = model_laplace.perplexity(test_data[0])print(f"Perplexity: {perplexity:.2f}")N-gram language models have been largely superseded by neural LMs (GPT, BERT), but they remain useful for lightweight, interpretable applications where a large model isn’t justified.