Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

N-grams in NLP

An n-gram is a contiguous sequence of n items from a piece of text. N-grams capture local context that a single-word (unigram) model misses — “New York” is more meaningful as a unit than “New” and “York” separately.


Types of N-grams

NamenExample from “the quick brown fox”
Unigram1”the”, “quick”, “brown”, “fox”
Bigram2”the quick”, “quick brown”, “brown fox”
Trigram3”the quick brown”, “quick brown fox”
4-gram4”the quick brown fox”

Character n-grams split on characters rather than words:


Generating N-grams with NLTK

from nltk.util import ngrams
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')
text = "Large language models transform natural language processing in remarkable ways."
tokens = word_tokenize(text.lower())
# Bigrams
bigrams = list(ngrams(tokens, 2))
print("Bigrams:", bigrams[:5])
# [('large', 'language'), ('language', 'models'), ('models', 'transform'),
# ('transform', 'natural'), ('natural', 'language')]
# Trigrams
trigrams = list(ngrams(tokens, 3))
print("Trigrams:", trigrams[:4])
# [('large', 'language', 'models'), ('language', 'models', 'transform'), ...]
# Frequency distribution
from nltk import FreqDist
bigram_freq = FreqDist(bigrams)
print("Most common bigrams:", bigram_freq.most_common(5))

N-grams with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"machine learning models process text data efficiently",
"deep learning transforms natural language processing tasks",
"language models generate fluent and coherent text",
"neural networks learn representations from large datasets"
]
# Bigrams only
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), stop_words='english')
X_bigrams = bigram_vectorizer.fit_transform(corpus)
print("Bigram features:")
print(bigram_vectorizer.get_feature_names_out())
# Mixed: unigrams + bigrams + trigrams
mixed_vectorizer = CountVectorizer(ngram_range=(1, 3), stop_words='english', max_features=50)
X_mixed = mixed_vectorizer.fit_transform(corpus)
print(f"\nMixed n-gram features: {X_mixed.shape[1]}")

Character N-grams

Character n-grams are especially powerful for:

from sklearn.feature_extraction.text import CountVectorizer
char_vectorizer = CountVectorizer(
analyzer='char_wb', # char n-grams, respecting word boundaries
ngram_range=(3, 5),
max_features=200
)
texts = ["tokenization", "tokenizing", "tokenizer", "untokenized"]
X = char_vectorizer.fit_transform(texts)
features = char_vectorizer.get_feature_names_out()
print("Character n-gram features:", features[:15])

N-gram Language Model

A language model assigns probabilities to sequences of words. An n-gram language model estimates the probability of the next word based on the previous n-1 words:

from nltk.lm import MLE, Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk
text = """
Natural language processing is a field of AI. Language models learn patterns
from text data. Modern NLP systems use deep learning. Transformers have
revolutionized language understanding and generation.
"""
sentences = [word_tokenize(s.lower()) for s in sent_tokenize(text)]
n = 3 # trigram model
train_data, vocab = padded_everygram_pipeline(n, sentences)
# Train an MLE trigram model
model = MLE(n)
model.fit(train_data, vocab)
# Score a sequence
print(model.score("language", ["natural", "language"])) # P("language" | "natural language")
# Generate text
generated = list(model.generate(10, random_seed=42))
print("Generated:", ' '.join(generated))

Practical Applications

Sentiment analysis features — bigrams like “not good” or “very fast” carry more sentiment than individual words.

Spam detection — character 4-grams catch obfuscation tricks like “V1agra” and “Vi@gra”.

Autocomplete — suggest the next word based on the most probable n-gram completion.

Plagiarism detection — compare character n-gram overlap between documents.

Language identification — character bigram/trigram distributions are language-specific fingerprints.

Named entity detection preprocessing — bigrams like “New York”, “Los Angeles” are more likely entities than either word alone.


N-gram Perplexity

Perplexity measures how well a language model predicts a test set. Lower perplexity = better model:

from nltk.lm import MLE, Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
# Laplace smoothing handles unseen n-grams (avoids zero probability)
model_laplace = Laplace(2)
train_data, vocab = padded_everygram_pipeline(2, sentences)
model_laplace.fit(train_data, vocab)
test_sentences = [word_tokenize("language models process text")]
test_data = [list(ngrams(s, 2)) for s in test_sentences]
perplexity = model_laplace.perplexity(test_data[0])
print(f"Perplexity: {perplexity:.2f}")

N-gram language models have been largely superseded by neural LMs (GPT, BERT), but they remain useful for lightweight, interpretable applications where a large model isn’t justified.