Text Vectorization in NLP

Text vectorization converts raw text into numeric vectors that machine learning models can process. No model works directly on strings — every NLP pipeline eventually converts words, sentences, or documents into numbers. The method you choose determines what information is preserved and what is lost.

The Vectorization Spectrum

Methods range from simple to semantically rich:

One-Hot → BoW → TF-IDF → Word2Vec → Sentence Embeddings → Transformer Encodings
  sparse                                                          dense
  no semantics                                          rich semantics
  fast                                                         slower

1. One-Hot Encoding

Assigns each word a unique integer, then creates a binary vector with a 1 in that position:

from sklearn.preprocessing import LabelEncoder
import numpy as np

vocabulary = ["cat", "dog", "fish", "bird"]
encoder = LabelEncoder()
encoder.fit(vocabulary)

def one_hot(word, vocab_size):
    idx = encoder.transform([word])[0]
    vec = np.zeros(vocab_size)
    vec[idx] = 1
    return vec

print(one_hot("cat",  4))   # [1. 0. 0. 0.]
print(one_hot("bird", 4))   # [0. 0. 0. 1.]

Problems: high dimensionality, no semantic relationships, doesn’t scale.

2. Bag of Words (Count Vectorizer)

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "Language models generate coherent text.",
    "Generative models learn text distributions.",
    "Transformers revolutionized language generation."
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Sparse matrix:\n", X.toarray())

3. TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1, 2),
    max_features=500,
    sublinear_tf=True   # log(1 + TF) to reduce effect of very common terms
)

X = vectorizer.fit_transform(corpus)
print(f"Shape: {X.shape}")  # (3, up-to-500-features)

4. Word Embedding Averaging

Average the embeddings of all words in a sentence to get a single document vector:

import numpy as np
from gensim.models import KeyedVectors
import gensim.downloader as api

# Load pre-trained Word2Vec
w2v = api.load("word2vec-google-news-300")

def average_word_vectors(sentence, model, dim=300):
    tokens = sentence.lower().split()
    vectors = [model[w] for w in tokens if w in model]
    if not vectors:
        return np.zeros(dim)
    return np.mean(vectors, axis=0)

sentences = [
    "Natural language processing transforms text understanding.",
    "Deep learning models learn from large data."
]

embeddings = [average_word_vectors(s, w2v) for s in sentences]
print(f"Vector shape: {embeddings[0].shape}")  # (300,)

5. Sentence-Transformer Encoding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

texts = [
    "How do I fine-tune a BERT model?",
    "The stock market declined following the Fed announcement.",
    "Cosine similarity measures the angle between two vectors."
]

embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
print(f"Embeddings shape: {embeddings.shape}")  # (3, 384)

6. Transformer Token Encoding (BERT)

For token-level tasks (NER, POS, span extraction), you need token-level representations:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "The new AI model achieved state-of-the-art results."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

# Shape: (batch, sequence_length, hidden_dim)
token_embeddings = outputs.last_hidden_state
cls_embedding = token_embeddings[:, 0, :]  # [CLS] as sentence representation

print(f"Token embeddings shape: {token_embeddings.shape}")  # (1, 12, 768)
print(f"CLS embedding shape: {cls_embedding.shape}")         # (1, 768)

Choosing the Right Method

Method	Memory	Speed	Semantic Quality	Task Fit
One-hot	Huge (sparse)	Fast	None	Toy examples
BoW	Large (sparse)	Fast	None	Topic modeling, baseline
TF-IDF	Large (sparse)	Fast	Low	Search, keyword matching
Avg. Word2Vec	Small (dense)	Fast	Medium	Text classification
Sentence-BERT	Small (dense)	Medium	High	Semantic search, RAG
BERT tokens	Small (dense)	Slow	Highest	NER, QA, fine-tuning

Full Vectorization Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression

# TF-IDF + dimensionality reduction + classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=10000)),
    ('svd', TruncatedSVD(n_components=100)),   # LSA: latent semantic analysis
    ('clf', LogisticRegression(max_iter=300))
])

# Fit and evaluate
# pipeline.fit(X_train, y_train)
# pipeline.score(X_test, y_test)

For new projects in 2025, start with a TF-IDF baseline, then try sentence-transformer embeddings. Only add the complexity of fine-tuned transformers if the baseline doesn’t meet your accuracy target.