Tokenization in NLP
Before any NLP model can process text, it has to break language down into discrete units it can work with. That breaking-down process is tokenization — and it’s more nuanced than it looks.
What Tokenization Does
Tokenization converts a raw string into a sequence of tokens. A token is any meaningful unit: a word, a punctuation mark, a subword fragment, or a whole sentence depending on the level.
Input: "The model couldn't understand."Tokens: ["The", "model", "could", "n't", "understand", "."]The sentence wasn’t just split on spaces. Contractions were split, punctuation was separated. Every tokenizer makes design decisions about where to cut.
Why It Matters
Every downstream task — sentiment analysis, named entity recognition, machine translation, text classification — sees only the token sequence, not the original string. A bad tokenizer produces tokens that carry wrong boundaries, lose meaning, or produce too many/too few units. Garbage in, garbage out at the very first step.
Word Tokenization
The simplest form: split on whitespace and punctuation.
NLTK word_tokenize:
import nltknltk.download('punkt_tab')from nltk.tokenize import word_tokenize
text = "Climate data from 2024 shows a 1.4°C rise, faster than expected."tokens = word_tokenize(text)print(tokens)# ['Climate', 'data', 'from', '2024', 'shows', 'a', '1.4', '°C', 'rise',# ',', 'faster', 'than', 'expected', '.']spaCy tokenizer:
import spacynlp = spacy.load("en_core_web_sm")doc = nlp("New York-based firms reported record Q1 earnings in 2025.")print([token.text for token in doc])# ['New', 'York', '-', 'based', 'firms', 'reported', 'record', 'Q1', 'earnings', 'in', '2025', '.']Sentence Tokenization
Split a paragraph into individual sentences before word-level processing:
from nltk.tokenize import sent_tokenize
paragraph = "NLP powers search engines and voice assistants. It also enables real-time translation. The field has grown rapidly since 2020."sentences = sent_tokenize(paragraph)for s in sentences: print(s)# NLP powers search engines and voice assistants.# It also enables real-time translation.# The field has grown rapidly since 2020.Subword Tokenization
Modern LLMs (BERT, GPT, LLaMA) don’t tokenize at the word level. They use subword methods that handle rare words and multiple languages without an exploding vocabulary.
Byte-Pair Encoding (BPE)
Start with characters, merge the most frequent adjacent pairs repeatedly until you reach a target vocabulary size. Rare words get split into known subword fragments.
"tokenization" → ["token", "ization"]"unhappiness" → ["un", "happiness"]"COVID-19" → ["C", "OV", "ID", "-", "19"]WordPiece (BERT)
Similar to BPE but merges based on a likelihood criterion rather than raw frequency. Unknown words become [UNK] if they can’t be decomposed.
SentencePiece
Language-agnostic subword tokenizer that treats the text as a raw byte stream — works for Chinese, Japanese, Arabic, and scripts without spaces.
Token Counts and LLM Context Windows
When working with LLMs in 2025, tokenization has direct cost implications. Providers charge per token. A typical English word is roughly 1.3 tokens on average with GPT-4’s tokenizer.
# Using tiktoken (OpenAI's tokenizer)import tiktokenenc = tiktoken.encoding_for_model("gpt-4o")text = "Explain the impact of transformer models on NLP research."tokens = enc.encode(text)print(f"Token count: {len(tokens)}") # → 10print(tokens) # → [849, 59, 3245, 279, ...]Tokenization Challenges
Hyphenated words: Is state-of-the-art one token or four? Depends on the tokenizer.
URLs and emails: user@domain.com shouldn’t be split into user, @, domain, ., com.
Multilingual text: Mixed-script documents need tokenizers that handle multiple writing systems.
Emojis and special characters: 🚀 launched — the emoji is a valid token in subword models.
Numbers with units: 3.5km — keep together or split?
Tokenization in the LLM Era (2025)
Traditional word tokenization is mostly used for rule-based NLP pipelines. For any work with transformer models, you’ll be using the model’s own tokenizer, which must be used consistently between training and inference:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")result = tokenizer("Large language models process subword tokens efficiently.")print(result['input_ids'])# [101, 2312, 2653, 4275, 2832, 4942, 17899, 19204, 2015, 14954, 1012, 102]print(tokenizer.convert_ids_to_tokens(result['input_ids']))# ['[CLS]', 'large', 'language', 'models', 'process', 'sub', '##word', 'tokens', 'efficiently', '.', '[SEP]']Notice subword becomes ['sub', '##word'] — the ## prefix indicates a continuation fragment, not a word start.