Part-of-Speech Tagging in NLP
Part-of-speech (POS) tagging assigns a grammatical label — noun, verb, adjective, adverb, and so on — to each word in a sentence. It’s one of the foundational steps in understanding the structure of language.
Why POS Tagging Matters
The same word can mean different things depending on its grammatical role:
- “The bank by the river” →
bankis a NOUN (landform) - “I need to bank that check” →
bankis a VERB (action) - “A fast car” →
fastis an ADJECTIVE - “She ran fast” →
fastis an ADVERB
Without POS tags, a system can’t disambiguate these. Downstream tasks — lemmatization, named entity recognition, syntax parsing, information extraction — all benefit from accurate POS labels.
Common Tag Sets
Penn Treebank POS Tags (most common in English NLP):
NN Noun, singular "dog", "city"NNS Noun, plural "dogs", "cities"NNP Proper noun, singular "London", "Alice"VB Verb, base form "run", "think"VBD Verb, past tense "ran", "thought"VBG Verb, gerund/participle "running", "thinking"JJ Adjective "fast", "beautiful"JJR Adjective, comparative "faster", "prettier"RB Adverb "quickly", "very"DT Determiner "the", "a", "an"IN Preposition "in", "on", "of"CC Coordinating conjunction "and", "but", "or"PRP Personal pronoun "I", "he", "they"POS Tagging with NLTK
import nltknltk.download('averaged_perceptron_tagger_eng')nltk.download('punkt_tab')
from nltk.tokenize import word_tokenizefrom nltk import pos_tag
text = "LLMs have revolutionized how developers build language-aware applications."tokens = word_tokenize(text)tagged = pos_tag(tokens)
print(tagged)# [('LLMs', 'NNS'), ('have', 'VBP'), ('revolutionized', 'VBN'),# ('how', 'WRB'), ('developers', 'NNS'), ('build', 'VBP'),# ('language-aware', 'JJ'), ('applications', 'NNS'), ('.', '.')]Extract only the nouns:
nouns = [word for word, tag in tagged if tag.startswith('NN')]print(nouns) # ['LLMs', 'developers', 'applications']POS Tagging with spaCy
spaCy provides both the Penn Treebank tag (token.tag_) and a simpler universal tag (token.pos_):
import spacynlp = spacy.load("en_core_web_sm")
text = "The new model released in 2025 outperforms its predecessors significantly."doc = nlp(text)
print(f"{'Token':<15} {'POS':<8} {'Tag':<8} {'Explanation'}")print("-" * 55)for token in doc: print(f"{token.text:<15} {token.pos_:<8} {token.tag_:<8} {spacy.explain(token.tag_)}")
# Token POS Tag Explanation# -----------------------------------------------------------# The DET DT determiner# new ADJ JJ adjective# model NOUN NN noun, singular or mass# released VERB VBN verb, past participle# 2025 NUM CD cardinal number# outperforms VERB VBZ verb, 3rd person singular present# predecessors NOUN NNS noun, plural# significantly ADV RB adverbUniversal POS Tags
When working with multilingual models, universal POS tags offer a consistent set across languages:
NOUN ADJ VERB ADV PRON DET ADP NUMCONJ PART PUNCT SYM X INTJspaCy’s token.pos_ returns these universal tags.
Real-World Applications
Keyword extraction — filter for nouns and noun phrases:
keywords = [token.text for token in doc if token.pos_ in ('NOUN', 'PROPN') and not token.is_stop]Sentiment-aware adjective extraction:
sentiments = [(token.text, token.pos_) for token in doc if token.pos_ == 'ADJ']Coreference — identify pronouns to resolve:
pronouns = [token.text for token in doc if token.pos_ == 'PRON']Grammar-based chunking — extract noun phrases:
for chunk in doc.noun_chunks: print(chunk.text, "→", chunk.root.pos_)POS Tagging Accuracy in 2025
Modern neural taggers built into spaCy, Stanza, and Hugging Face pipelines achieve 97–99% accuracy on standard English corpora. Accuracy drops when handling:
- Domain-specific jargon (medical, legal, financial)
- Code-switching (mixed-language text)
- Social media text with non-standard grammar
- Very low-resource languages
For specialized domains, fine-tuning a small transformer model on in-domain annotated data produces the best results.