Text Vectorization in NLP
Text vectorization converts raw text into numeric vectors that machine learning models can process. No model works directly on strings — every NLP pipeline eventually converts words, sentences, or documents into numbers. The method you choose determines what information is preserved and what is lost.
The Vectorization Spectrum
Methods range from simple to semantically rich:
One-Hot → BoW → TF-IDF → Word2Vec → Sentence Embeddings → Transformer Encodings sparse dense no semantics rich semantics fast slower1. One-Hot Encoding
Assigns each word a unique integer, then creates a binary vector with a 1 in that position:
from sklearn.preprocessing import LabelEncoderimport numpy as np
vocabulary = ["cat", "dog", "fish", "bird"]encoder = LabelEncoder()encoder.fit(vocabulary)
def one_hot(word, vocab_size): idx = encoder.transform([word])[0] vec = np.zeros(vocab_size) vec[idx] = 1 return vec
print(one_hot("cat", 4)) # [1. 0. 0. 0.]print(one_hot("bird", 4)) # [0. 0. 0. 1.]Problems: high dimensionality, no semantic relationships, doesn’t scale.
2. Bag of Words (Count Vectorizer)
from sklearn.feature_extraction.text import CountVectorizer
corpus = [ "Language models generate coherent text.", "Generative models learn text distributions.", "Transformers revolutionized language generation."]
vectorizer = CountVectorizer()X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())print("Sparse matrix:\n", X.toarray())3. TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer( stop_words='english', ngram_range=(1, 2), max_features=500, sublinear_tf=True # log(1 + TF) to reduce effect of very common terms)
X = vectorizer.fit_transform(corpus)print(f"Shape: {X.shape}") # (3, up-to-500-features)4. Word Embedding Averaging
Average the embeddings of all words in a sentence to get a single document vector:
import numpy as npfrom gensim.models import KeyedVectorsimport gensim.downloader as api
# Load pre-trained Word2Vecw2v = api.load("word2vec-google-news-300")
def average_word_vectors(sentence, model, dim=300): tokens = sentence.lower().split() vectors = [model[w] for w in tokens if w in model] if not vectors: return np.zeros(dim) return np.mean(vectors, axis=0)
sentences = [ "Natural language processing transforms text understanding.", "Deep learning models learn from large data."]
embeddings = [average_word_vectors(s, w2v) for s in sentences]print(f"Vector shape: {embeddings[0].shape}") # (300,)5. Sentence-Transformer Encoding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [ "How do I fine-tune a BERT model?", "The stock market declined following the Fed announcement.", "Cosine similarity measures the angle between two vectors."]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)print(f"Embeddings shape: {embeddings.shape}") # (3, 384)6. Transformer Token Encoding (BERT)
For token-level tasks (NER, POS, span extraction), you need token-level representations:
from transformers import AutoTokenizer, AutoModelimport torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = AutoModel.from_pretrained("bert-base-uncased")
text = "The new AI model achieved state-of-the-art results."inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad(): outputs = model(**inputs)
# Shape: (batch, sequence_length, hidden_dim)token_embeddings = outputs.last_hidden_statecls_embedding = token_embeddings[:, 0, :] # [CLS] as sentence representation
print(f"Token embeddings shape: {token_embeddings.shape}") # (1, 12, 768)print(f"CLS embedding shape: {cls_embedding.shape}") # (1, 768)Choosing the Right Method
| Method | Memory | Speed | Semantic Quality | Task Fit |
|---|---|---|---|---|
| One-hot | Huge (sparse) | Fast | None | Toy examples |
| BoW | Large (sparse) | Fast | None | Topic modeling, baseline |
| TF-IDF | Large (sparse) | Fast | Low | Search, keyword matching |
| Avg. Word2Vec | Small (dense) | Fast | Medium | Text classification |
| Sentence-BERT | Small (dense) | Medium | High | Semantic search, RAG |
| BERT tokens | Small (dense) | Slow | Highest | NER, QA, fine-tuning |
Full Vectorization Pipeline
from sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.decomposition import TruncatedSVDfrom sklearn.linear_model import LogisticRegression
# TF-IDF + dimensionality reduction + classifierpipeline = Pipeline([ ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=10000)), ('svd', TruncatedSVD(n_components=100)), # LSA: latent semantic analysis ('clf', LogisticRegression(max_iter=300))])
# Fit and evaluate# pipeline.fit(X_train, y_train)# pipeline.score(X_test, y_test)For new projects in 2025, start with a TF-IDF baseline, then try sentence-transformer embeddings. Only add the complexity of fine-tuned transformers if the baseline doesn’t meet your accuracy target.