FastText

FastText is Meta AI’s text representation library that extends Word2Vec with subword embeddings — it represents words as bags of character n-grams. This means FastText can generate vectors for words it has never seen, handle misspellings, and work well with morphologically rich languages.

Installation

pip install fasttext

The Key Difference: Subword Embeddings

In Word2Vec, “running” has a single vector. In FastText, “running” is represented as the sum of its character n-gram vectors:

"running" → ["<ru", "run", "unn", "nni", "nin", "ing", "ng>"]

This means:

“runninggg” (typo) → still generates a reasonable vector
“preprocessing” (unseen word) → inferred from its n-grams
Morphological relatives (“run”, “runner”, “runners”) share similar representations

Training Word Embeddings with FastText

import fasttext
import tempfile

# Prepare training data (one document per line)
corpus = """
natural language processing transforms how computers understand text
machine learning models learn patterns from large text corpora
deep learning architectures like transformers process sequential data
bert and gpt are built on transformer self-attention mechanisms
word embeddings capture semantic relationships in continuous vector space
fasttext uses character ngrams for subword representations
"""

with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
    f.write(corpus)
    corpus_file = f.name

# Train Skip-Gram model
model = fasttext.train_unsupervised(
    corpus_file,
    model='skipgram',    # 'skipgram' or 'cbow'
    dim=100,             # embedding dimension
    ws=5,                # window size
    minCount=1,          # minimum word count
    minn=2,              # min char n-gram size
    maxn=6,              # max char n-gram size
    epoch=10
)

# Word vector
vector = model.get_word_vector("transformer")
print(f"Vector dim: {len(vector)}")  # 100

# Subword handling — works for OOV words
oov_vector = model.get_word_vector("transformerssss")
print(f"OOV vector (non-zero): {any(oov_vector != 0)}")  # True

Text Classification with FastText

FastText is renowned for its text classification speed — it can train on millions of documents in seconds:

import fasttext
import tempfile, os

# FastText format: __label__<label> text
train_data = """__label__tech The GPU ran out of memory during model training
__label__tech The API returned a 500 error after 30 seconds
__label__tech CUDA drivers need to be updated for PyTorch 2.0
__label__food The pasta needs more olive oil and fresh basil
__label__food Sourdough bread requires a 12-hour fermentation period
__label__food Roasting vegetables at 425°F brings out their natural sweetness
__label__finance The Federal Reserve raised rates by 25 basis points
__label__finance Q3 earnings exceeded analyst expectations by 12%
__label__finance The bond yield inverted ahead of the recession
"""

with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
    f.write(train_data)
    train_file = f.name

model = fasttext.train_supervised(
    input=train_file,
    epoch=50,
    lr=0.5,
    wordNgrams=2,      # bigram features
    dim=100,
    loss='softmax'
)

# Predict
tests = [
    "The model accuracy dropped after fine-tuning on new data",
    "Homemade pizza dough should rest for at least one hour",
    "Interest rates remain elevated despite recent inflation data"
]

for text in tests:
    label, confidence = model.predict(text)
    print(f"[{label[0].replace('__label__', ''):<10} {confidence[0]:.3f}] {text}")

Evaluating the Classifier

# Save test data
test_data = """__label__tech Backpropagation updates neural network weights iteratively
__label__food Season the chicken with salt pepper and lemon
__label__finance The stock market declined following the Fed announcement
"""

with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
    f.write(test_data)
    test_file = f.name

result = model.test(test_file)
print(f"Samples: {result[0]}")
print(f"Precision@1: {result[1]:.4f}")
print(f"Recall@1:    {result[2]:.4f}")

Loading Pre-Trained FastText Vectors

Meta provides pre-trained FastText vectors for 157 languages:

import fasttext
import fasttext.util

# Download English vectors (7.2GB compressed)
# fasttext.util.download_model('en', if_exists='ignore')

# Load the model
ft = fasttext.load_model('cc.en.300.bin')

# Get vectors
print(ft.get_word_vector("natural").shape)    # (300,)
print(ft.get_word_vector("preprocessing").shape)  # (300,) — even if OOV

# Nearest neighbors
neighbors = ft.get_nearest_neighbors("embedding", k=5)
for score, word in neighbors:
    print(f"{word}: {score:.4f}")

Language Identification

FastText ships a built-in language identification model:

import fasttext

# Download the model
model = fasttext.load_model("lid.176.ftz")  # Compact 917KB model

texts = [
    "Natural language processing is transforming how AI understands text.",
    "Le traitement du langage naturel transforme l'intelligence artificielle.",
    "Verarbeitung natürlicher Sprache revolutioniert die künstliche Intelligenz.",
    "自然言語処理は人工知能を変革しています。",
]

for text in texts:
    predictions = model.predict(text)
    lang = predictions[0][0].replace('__label__', '')
    conf = predictions[1][0]
    print(f"[{lang} {conf:.3f}] {text[:50]}")

FastText vs Word2Vec vs BERT

Feature	FastText	Word2Vec	BERT
Subword handling	Yes	No	Yes (BPE)
OOV words	Yes	No	Yes
Training speed	Very fast	Fast	Slow
Classification	Built-in	No	Via fine-tuning
Context-aware	No	No	Yes
Accuracy	Good	Good	Best

FastText is the best choice when you need fast text classification at scale, support for misspellings and morphologically rich languages, or a lightweight model that doesn’t need a GPU.