NLTK — Natural Language Toolkit
NLTK (Natural Language Toolkit) is Python’s foundational NLP library. It provides essential tools for text processing — tokenization, stemming, POS tagging, parsing, and access to over 50 corpora. While newer libraries like spaCy and Hugging Face handle production workloads faster, NLTK remains the best starting point for learning NLP concepts.
Installation and Setup
pip install nltkDownload the corpora and models you need:
import nltk
# Download everything (large, ~3GB total)# nltk.download('all')
# Or download only what you neednltk.download('punkt_tab') # Tokenizernltk.download('averaged_perceptron_tagger_eng') # POS taggernltk.download('wordnet') # Lemmatizernltk.download('stopwords') # Stopword listsnltk.download('vader_lexicon') # Sentiment analyzernltk.download('maxent_ne_chunker_tab') # NER chunkernltk.download('words') # English word corpusTokenization
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer
text = "NLP has evolved dramatically in 2025! Models like Claude and GPT-4 demonstrate remarkable language understanding."
# Word tokenswords = word_tokenize(text)print("Words:", words[:8])# ['NLP', 'has', 'evolved', 'dramatically', 'in', '2025', '!', 'Models']
# Sentence tokenssentences = sent_tokenize(text)print("Sentences:", sentences)
# Tweet tokenizer (handles @mentions, #hashtags, emoji)tweet_tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)tweet = "@user This new LLM is sooooo impressive!!! #NLP #AI"print("Tweet tokens:", tweet_tokenizer.tokenize(tweet))# ['This', 'new', 'LLM', 'is', 'sooo', 'impressive', '!', '!', '!', '#NLP', '#AI']Stemming
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
words = ["running", "studies", "happily", "generously", "transformers"]
porter = PorterStemmer()snowball = SnowballStemmer("english")lancaster = LancasterStemmer()
print(f"{'Word':<15} {'Porter':<12} {'Snowball':<12} {'Lancaster'}")print("-" * 55)for word in words: print(f"{word:<15} {porter.stem(word):<12} {snowball.stem(word):<12} {lancaster.stem(word)}")Lemmatization
from nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnet
lem = WordNetLemmatizer()
# Lemmatize with POS for best accuracydef lemmatize_with_pos(word, pos_tag): pos_map = { 'J': wordnet.ADJ, 'V': wordnet.VERB, 'N': wordnet.NOUN, 'R': wordnet.ADV } wn_pos = pos_map.get(pos_tag[0], wordnet.NOUN) return lem.lemmatize(word, pos=wn_pos)
pairs = [("running", "VBG"), ("studies", "NNS"), ("better", "JJR"), ("quickly", "RB")]for word, tag in pairs: print(f"{word:<12} → {lemmatize_with_pos(word, tag)}")# running → run# studies → study# better → good# quickly → quicklyPOS Tagging
from nltk import pos_tag, word_tokenize
text = "The neural network efficiently processes sequential text data."tokens = word_tokenize(text)tagged = pos_tag(tokens)
print(tagged)# [('The', 'DT'), ('neural', 'JJ'), ('network', 'NN'), ('efficiently', 'RB'), ...]
# Extract specific POSnouns = [(w, t) for w, t in tagged if t.startswith('NN')]verbs = [(w, t) for w, t in tagged if t.startswith('VB')]print("Nouns:", nouns)print("Verbs:", verbs)Named Entity Recognition
from nltk import ne_chunk, pos_tag, word_tokenizefrom nltk.tree import Tree
text = "Anthropic was founded in San Francisco by Dario Amodei in 2021."tokens = word_tokenize(text)tagged = pos_tag(tokens)entities = ne_chunk(tagged)
# Extract named entitiesnamed_entities = []for chunk in entities: if isinstance(chunk, Tree): entity = " ".join(w for w, t in chunk) named_entities.append((entity, chunk.label()))
print(named_entities)# [('Anthropic', 'ORGANIZATION'), ('San Francisco', 'GPE'), ('Dario Amodei', 'PERSON')]Sentiment Analysis (VADER)
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
reviews = [ "This library is absolutely fantastic for learning NLP concepts!", "The documentation is outdated and the API is confusing.", "It works, but there are faster alternatives available."]
for review in reviews: scores = sia.polarity_scores(review) sentiment = "Positive" if scores['compound'] > 0.05 else "Negative" if scores['compound'] < -0.05 else "Neutral" print(f"{sentiment} ({scores['compound']:.3f}): {review[:50]}")Frequency Distributions
from nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsfrom nltk import FreqDist
text = """Transformer models have revolutionized natural language processing.Attention mechanisms allow models to capture long-range dependencies.BERT, GPT, and Claude are built on transformer architecture."""
stop_words = set(stopwords.words('english'))tokens = [w.lower() for w in word_tokenize(text) if w.isalpha() and w.lower() not in stop_words]
fdist = FreqDist(tokens)print("Most common words:", fdist.most_common(10))fdist.plot(10) # matplotlib frequency chartWhen to Use NLTK vs Other Libraries
| Task | NLTK | spaCy | Hugging Face |
|---|---|---|---|
| Learning NLP | Best | Good | Overkill for basics |
| Production pipeline | Slow | Fast | Best for accuracy |
| Corpus access | 50+ corpora | Limited | Large Hub |
| Custom grammars | Yes (CFG) | No | No |
| Multilingual | Limited | Good (20+ langs) | Excellent |
| Sentiment (VADER) | Built-in | No | Requires model |
NLTK shines for education, research prototypes, and tasks that need grammar-based analysis. For production NLP in 2025, spaCy handles most preprocessing needs faster, while Hugging Face models provide the highest accuracy.