FastText
FastText is Meta AI’s text representation library that extends Word2Vec with subword embeddings — it represents words as bags of character n-grams. This means FastText can generate vectors for words it has never seen, handle misspellings, and work well with morphologically rich languages.
Installation
pip install fasttextThe Key Difference: Subword Embeddings
In Word2Vec, “running” has a single vector. In FastText, “running” is represented as the sum of its character n-gram vectors:
"running" → ["<ru", "run", "unn", "nni", "nin", "ing", "ng>"]This means:
- “runninggg” (typo) → still generates a reasonable vector
- “preprocessing” (unseen word) → inferred from its n-grams
- Morphological relatives (“run”, “runner”, “runners”) share similar representations
Training Word Embeddings with FastText
import fasttextimport tempfile
# Prepare training data (one document per line)corpus = """natural language processing transforms how computers understand textmachine learning models learn patterns from large text corporadeep learning architectures like transformers process sequential databert and gpt are built on transformer self-attention mechanismsword embeddings capture semantic relationships in continuous vector spacefasttext uses character ngrams for subword representations"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f: f.write(corpus) corpus_file = f.name
# Train Skip-Gram modelmodel = fasttext.train_unsupervised( corpus_file, model='skipgram', # 'skipgram' or 'cbow' dim=100, # embedding dimension ws=5, # window size minCount=1, # minimum word count minn=2, # min char n-gram size maxn=6, # max char n-gram size epoch=10)
# Word vectorvector = model.get_word_vector("transformer")print(f"Vector dim: {len(vector)}") # 100
# Subword handling — works for OOV wordsoov_vector = model.get_word_vector("transformerssss")print(f"OOV vector (non-zero): {any(oov_vector != 0)}") # TrueText Classification with FastText
FastText is renowned for its text classification speed — it can train on millions of documents in seconds:
import fasttextimport tempfile, os
# FastText format: __label__<label> texttrain_data = """__label__tech The GPU ran out of memory during model training__label__tech The API returned a 500 error after 30 seconds__label__tech CUDA drivers need to be updated for PyTorch 2.0__label__food The pasta needs more olive oil and fresh basil__label__food Sourdough bread requires a 12-hour fermentation period__label__food Roasting vegetables at 425°F brings out their natural sweetness__label__finance The Federal Reserve raised rates by 25 basis points__label__finance Q3 earnings exceeded analyst expectations by 12%__label__finance The bond yield inverted ahead of the recession"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f: f.write(train_data) train_file = f.name
model = fasttext.train_supervised( input=train_file, epoch=50, lr=0.5, wordNgrams=2, # bigram features dim=100, loss='softmax')
# Predicttests = [ "The model accuracy dropped after fine-tuning on new data", "Homemade pizza dough should rest for at least one hour", "Interest rates remain elevated despite recent inflation data"]
for text in tests: label, confidence = model.predict(text) print(f"[{label[0].replace('__label__', ''):<10} {confidence[0]:.3f}] {text}")Evaluating the Classifier
# Save test datatest_data = """__label__tech Backpropagation updates neural network weights iteratively__label__food Season the chicken with salt pepper and lemon__label__finance The stock market declined following the Fed announcement"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f: f.write(test_data) test_file = f.name
result = model.test(test_file)print(f"Samples: {result[0]}")print(f"Precision@1: {result[1]:.4f}")print(f"Recall@1: {result[2]:.4f}")Loading Pre-Trained FastText Vectors
Meta provides pre-trained FastText vectors for 157 languages:
import fasttextimport fasttext.util
# Download English vectors (7.2GB compressed)# fasttext.util.download_model('en', if_exists='ignore')
# Load the modelft = fasttext.load_model('cc.en.300.bin')
# Get vectorsprint(ft.get_word_vector("natural").shape) # (300,)print(ft.get_word_vector("preprocessing").shape) # (300,) — even if OOV
# Nearest neighborsneighbors = ft.get_nearest_neighbors("embedding", k=5)for score, word in neighbors: print(f"{word}: {score:.4f}")Language Identification
FastText ships a built-in language identification model:
import fasttext
# Download the modelmodel = fasttext.load_model("lid.176.ftz") # Compact 917KB model
texts = [ "Natural language processing is transforming how AI understands text.", "Le traitement du langage naturel transforme l'intelligence artificielle.", "Verarbeitung natürlicher Sprache revolutioniert die künstliche Intelligenz.", "自然言語処理は人工知能を変革しています。",]
for text in texts: predictions = model.predict(text) lang = predictions[0][0].replace('__label__', '') conf = predictions[1][0] print(f"[{lang} {conf:.3f}] {text[:50]}")FastText vs Word2Vec vs BERT
| Feature | FastText | Word2Vec | BERT |
|---|---|---|---|
| Subword handling | Yes | No | Yes (BPE) |
| OOV words | Yes | No | Yes |
| Training speed | Very fast | Fast | Slow |
| Classification | Built-in | No | Via fine-tuning |
| Context-aware | No | No | Yes |
| Accuracy | Good | Good | Best |
FastText is the best choice when you need fast text classification at scale, support for misspellings and morphologically rich languages, or a lightweight model that doesn’t need a GPU.