Tokenization in NLP

Before any NLP model can process text, it has to break language down into discrete units it can work with. That breaking-down process is tokenization — and it’s more nuanced than it looks.

What Tokenization Does

Tokenization converts a raw string into a sequence of tokens. A token is any meaningful unit: a word, a punctuation mark, a subword fragment, or a whole sentence depending on the level.

Input:  "The model couldn't understand."
Tokens: ["The", "model", "could", "n't", "understand", "."]

The sentence wasn’t just split on spaces. Contractions were split, punctuation was separated. Every tokenizer makes design decisions about where to cut.

Why It Matters

Every downstream task — sentiment analysis, named entity recognition, machine translation, text classification — sees only the token sequence, not the original string. A bad tokenizer produces tokens that carry wrong boundaries, lose meaning, or produce too many/too few units. Garbage in, garbage out at the very first step.

Word Tokenization

The simplest form: split on whitespace and punctuation.

NLTK word_tokenize:

import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

text = "Climate data from 2024 shows a 1.4°C rise, faster than expected."
tokens = word_tokenize(text)
print(tokens)
# ['Climate', 'data', 'from', '2024', 'shows', 'a', '1.4', '°C', 'rise',
#  ',', 'faster', 'than', 'expected', '.']

spaCy tokenizer:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("New York-based firms reported record Q1 earnings in 2025.")
print([token.text for token in doc])
# ['New', 'York', '-', 'based', 'firms', 'reported', 'record', 'Q1', 'earnings', 'in', '2025', '.']

Sentence Tokenization

Split a paragraph into individual sentences before word-level processing:

from nltk.tokenize import sent_tokenize

paragraph = "NLP powers search engines and voice assistants. It also enables real-time translation. The field has grown rapidly since 2020."
sentences = sent_tokenize(paragraph)
for s in sentences:
    print(s)
# NLP powers search engines and voice assistants.
# It also enables real-time translation.
# The field has grown rapidly since 2020.

Subword Tokenization

Modern LLMs (BERT, GPT, LLaMA) don’t tokenize at the word level. They use subword methods that handle rare words and multiple languages without an exploding vocabulary.

Byte-Pair Encoding (BPE)

Start with characters, merge the most frequent adjacent pairs repeatedly until you reach a target vocabulary size. Rare words get split into known subword fragments.

"tokenization" → ["token", "ization"]
"unhappiness"  → ["un", "happiness"]
"COVID-19"     → ["C", "OV", "ID", "-", "19"]

WordPiece (BERT)

Similar to BPE but merges based on a likelihood criterion rather than raw frequency. Unknown words become [UNK] if they can’t be decomposed.

SentencePiece

Language-agnostic subword tokenizer that treats the text as a raw byte stream — works for Chinese, Japanese, Arabic, and scripts without spaces.

Token Counts and LLM Context Windows

When working with LLMs in 2025, tokenization has direct cost implications. Providers charge per token. A typical English word is roughly 1.3 tokens on average with GPT-4’s tokenizer.

# Using tiktoken (OpenAI's tokenizer)
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Explain the impact of transformer models on NLP research."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")   # → 10
print(tokens)                           # → [849, 59, 3245, 279, ...]

Tokenization Challenges

Hyphenated words: Is state-of-the-art one token or four? Depends on the tokenizer.

URLs and emails: user@domain.com shouldn’t be split into user, @, domain, ., com.

Multilingual text: Mixed-script documents need tokenizers that handle multiple writing systems.

Emojis and special characters: 🚀 launched — the emoji is a valid token in subword models.

Numbers with units: 3.5km — keep together or split?

Tokenization in the LLM Era (2025)

Traditional word tokenization is mostly used for rule-based NLP pipelines. For any work with transformer models, you’ll be using the model’s own tokenizer, which must be used consistently between training and inference:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
result = tokenizer("Large language models process subword tokens efficiently.")
print(result['input_ids'])
# [101, 2312, 2653, 4275, 2832, 4942, 17899, 19204, 2015, 14954, 1012, 102]
print(tokenizer.convert_ids_to_tokens(result['input_ids']))
# ['[CLS]', 'large', 'language', 'models', 'process', 'sub', '##word', 'tokens', 'efficiently', '.', '[SEP]']

Notice subword becomes ['sub', '##word'] — the ## prefix indicates a continuation fragment, not a word start.