Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Text Vectorization in NLP

Text vectorization converts raw text into numeric vectors that machine learning models can process. No model works directly on strings — every NLP pipeline eventually converts words, sentences, or documents into numbers. The method you choose determines what information is preserved and what is lost.


The Vectorization Spectrum

Methods range from simple to semantically rich:

One-Hot → BoW → TF-IDF → Word2Vec → Sentence Embeddings → Transformer Encodings
sparse dense
no semantics rich semantics
fast slower

1. One-Hot Encoding

Assigns each word a unique integer, then creates a binary vector with a 1 in that position:

from sklearn.preprocessing import LabelEncoder
import numpy as np
vocabulary = ["cat", "dog", "fish", "bird"]
encoder = LabelEncoder()
encoder.fit(vocabulary)
def one_hot(word, vocab_size):
idx = encoder.transform([word])[0]
vec = np.zeros(vocab_size)
vec[idx] = 1
return vec
print(one_hot("cat", 4)) # [1. 0. 0. 0.]
print(one_hot("bird", 4)) # [0. 0. 0. 1.]

Problems: high dimensionality, no semantic relationships, doesn’t scale.


2. Bag of Words (Count Vectorizer)

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"Language models generate coherent text.",
"Generative models learn text distributions.",
"Transformers revolutionized language generation."
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("Sparse matrix:\n", X.toarray())

3. TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(
stop_words='english',
ngram_range=(1, 2),
max_features=500,
sublinear_tf=True # log(1 + TF) to reduce effect of very common terms
)
X = vectorizer.fit_transform(corpus)
print(f"Shape: {X.shape}") # (3, up-to-500-features)

4. Word Embedding Averaging

Average the embeddings of all words in a sentence to get a single document vector:

import numpy as np
from gensim.models import KeyedVectors
import gensim.downloader as api
# Load pre-trained Word2Vec
w2v = api.load("word2vec-google-news-300")
def average_word_vectors(sentence, model, dim=300):
tokens = sentence.lower().split()
vectors = [model[w] for w in tokens if w in model]
if not vectors:
return np.zeros(dim)
return np.mean(vectors, axis=0)
sentences = [
"Natural language processing transforms text understanding.",
"Deep learning models learn from large data."
]
embeddings = [average_word_vectors(s, w2v) for s in sentences]
print(f"Vector shape: {embeddings[0].shape}") # (300,)

5. Sentence-Transformer Encoding

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [
"How do I fine-tune a BERT model?",
"The stock market declined following the Fed announcement.",
"Cosine similarity measures the angle between two vectors."
]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
print(f"Embeddings shape: {embeddings.shape}") # (3, 384)

6. Transformer Token Encoding (BERT)

For token-level tasks (NER, POS, span extraction), you need token-level representations:

from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
text = "The new AI model achieved state-of-the-art results."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# Shape: (batch, sequence_length, hidden_dim)
token_embeddings = outputs.last_hidden_state
cls_embedding = token_embeddings[:, 0, :] # [CLS] as sentence representation
print(f"Token embeddings shape: {token_embeddings.shape}") # (1, 12, 768)
print(f"CLS embedding shape: {cls_embedding.shape}") # (1, 768)

Choosing the Right Method

MethodMemorySpeedSemantic QualityTask Fit
One-hotHuge (sparse)FastNoneToy examples
BoWLarge (sparse)FastNoneTopic modeling, baseline
TF-IDFLarge (sparse)FastLowSearch, keyword matching
Avg. Word2VecSmall (dense)FastMediumText classification
Sentence-BERTSmall (dense)MediumHighSemantic search, RAG
BERT tokensSmall (dense)SlowHighestNER, QA, fine-tuning

Full Vectorization Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
# TF-IDF + dimensionality reduction + classifier
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=10000)),
('svd', TruncatedSVD(n_components=100)), # LSA: latent semantic analysis
('clf', LogisticRegression(max_iter=300))
])
# Fit and evaluate
# pipeline.fit(X_train, y_train)
# pipeline.score(X_test, y_test)

For new projects in 2025, start with a TF-IDF baseline, then try sentence-transformer embeddings. Only add the complexity of fine-tuned transformers if the baseline doesn’t meet your accuracy target.