Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

FastText

FastText is Meta AI’s text representation library that extends Word2Vec with subword embeddings — it represents words as bags of character n-grams. This means FastText can generate vectors for words it has never seen, handle misspellings, and work well with morphologically rich languages.


Installation

Terminal window
pip install fasttext

The Key Difference: Subword Embeddings

In Word2Vec, “running” has a single vector. In FastText, “running” is represented as the sum of its character n-gram vectors:

"running" → ["<ru", "run", "unn", "nni", "nin", "ing", "ng>"]

This means:


Training Word Embeddings with FastText

import fasttext
import tempfile
# Prepare training data (one document per line)
corpus = """
natural language processing transforms how computers understand text
machine learning models learn patterns from large text corpora
deep learning architectures like transformers process sequential data
bert and gpt are built on transformer self-attention mechanisms
word embeddings capture semantic relationships in continuous vector space
fasttext uses character ngrams for subword representations
"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write(corpus)
corpus_file = f.name
# Train Skip-Gram model
model = fasttext.train_unsupervised(
corpus_file,
model='skipgram', # 'skipgram' or 'cbow'
dim=100, # embedding dimension
ws=5, # window size
minCount=1, # minimum word count
minn=2, # min char n-gram size
maxn=6, # max char n-gram size
epoch=10
)
# Word vector
vector = model.get_word_vector("transformer")
print(f"Vector dim: {len(vector)}") # 100
# Subword handling — works for OOV words
oov_vector = model.get_word_vector("transformerssss")
print(f"OOV vector (non-zero): {any(oov_vector != 0)}") # True

Text Classification with FastText

FastText is renowned for its text classification speed — it can train on millions of documents in seconds:

import fasttext
import tempfile, os
# FastText format: __label__<label> text
train_data = """__label__tech The GPU ran out of memory during model training
__label__tech The API returned a 500 error after 30 seconds
__label__tech CUDA drivers need to be updated for PyTorch 2.0
__label__food The pasta needs more olive oil and fresh basil
__label__food Sourdough bread requires a 12-hour fermentation period
__label__food Roasting vegetables at 425°F brings out their natural sweetness
__label__finance The Federal Reserve raised rates by 25 basis points
__label__finance Q3 earnings exceeded analyst expectations by 12%
__label__finance The bond yield inverted ahead of the recession
"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write(train_data)
train_file = f.name
model = fasttext.train_supervised(
input=train_file,
epoch=50,
lr=0.5,
wordNgrams=2, # bigram features
dim=100,
loss='softmax'
)
# Predict
tests = [
"The model accuracy dropped after fine-tuning on new data",
"Homemade pizza dough should rest for at least one hour",
"Interest rates remain elevated despite recent inflation data"
]
for text in tests:
label, confidence = model.predict(text)
print(f"[{label[0].replace('__label__', ''):<10} {confidence[0]:.3f}] {text}")

Evaluating the Classifier

# Save test data
test_data = """__label__tech Backpropagation updates neural network weights iteratively
__label__food Season the chicken with salt pepper and lemon
__label__finance The stock market declined following the Fed announcement
"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
f.write(test_data)
test_file = f.name
result = model.test(test_file)
print(f"Samples: {result[0]}")
print(f"Precision@1: {result[1]:.4f}")
print(f"Recall@1: {result[2]:.4f}")

Loading Pre-Trained FastText Vectors

Meta provides pre-trained FastText vectors for 157 languages:

import fasttext
import fasttext.util
# Download English vectors (7.2GB compressed)
# fasttext.util.download_model('en', if_exists='ignore')
# Load the model
ft = fasttext.load_model('cc.en.300.bin')
# Get vectors
print(ft.get_word_vector("natural").shape) # (300,)
print(ft.get_word_vector("preprocessing").shape) # (300,) — even if OOV
# Nearest neighbors
neighbors = ft.get_nearest_neighbors("embedding", k=5)
for score, word in neighbors:
print(f"{word}: {score:.4f}")

Language Identification

FastText ships a built-in language identification model:

import fasttext
# Download the model
model = fasttext.load_model("lid.176.ftz") # Compact 917KB model
texts = [
"Natural language processing is transforming how AI understands text.",
"Le traitement du langage naturel transforme l'intelligence artificielle.",
"Verarbeitung natürlicher Sprache revolutioniert die künstliche Intelligenz.",
"自然言語処理は人工知能を変革しています。",
]
for text in texts:
predictions = model.predict(text)
lang = predictions[0][0].replace('__label__', '')
conf = predictions[1][0]
print(f"[{lang} {conf:.3f}] {text[:50]}")

FastText vs Word2Vec vs BERT

FeatureFastTextWord2VecBERT
Subword handlingYesNoYes (BPE)
OOV wordsYesNoYes
Training speedVery fastFastSlow
ClassificationBuilt-inNoVia fine-tuning
Context-awareNoNoYes
AccuracyGoodGoodBest

FastText is the best choice when you need fast text classification at scale, support for misspellings and morphologically rich languages, or a lightweight model that doesn’t need a GPU.