Part-of-Speech Tagging in NLP

Part-of-speech (POS) tagging assigns a grammatical label — noun, verb, adjective, adverb, and so on — to each word in a sentence. It’s one of the foundational steps in understanding the structure of language.

Why POS Tagging Matters

The same word can mean different things depending on its grammatical role:

“The bank by the river” → bank is a NOUN (landform)
“I need to bank that check” → bank is a VERB (action)
“A fast car” → fast is an ADJECTIVE
“She ran fast” → fast is an ADVERB

Without POS tags, a system can’t disambiguate these. Downstream tasks — lemmatization, named entity recognition, syntax parsing, information extraction — all benefit from accurate POS labels.

Common Tag Sets

Penn Treebank POS Tags (most common in English NLP):

NN    Noun, singular          "dog", "city"
NNS   Noun, plural            "dogs", "cities"
NNP   Proper noun, singular   "London", "Alice"
VB    Verb, base form         "run", "think"
VBD   Verb, past tense        "ran", "thought"
VBG   Verb, gerund/participle "running", "thinking"
JJ    Adjective               "fast", "beautiful"
JJR   Adjective, comparative  "faster", "prettier"
RB    Adverb                  "quickly", "very"
DT    Determiner              "the", "a", "an"
IN    Preposition             "in", "on", "of"
CC    Coordinating conjunction "and", "but", "or"
PRP   Personal pronoun        "I", "he", "they"

POS Tagging with NLTK

import nltk
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk import pos_tag

text = "LLMs have revolutionized how developers build language-aware applications."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)

print(tagged)
# [('LLMs', 'NNS'), ('have', 'VBP'), ('revolutionized', 'VBN'),
#  ('how', 'WRB'), ('developers', 'NNS'), ('build', 'VBP'),
#  ('language-aware', 'JJ'), ('applications', 'NNS'), ('.', '.')]

Extract only the nouns:

nouns = [word for word, tag in tagged if tag.startswith('NN')]
print(nouns)  # ['LLMs', 'developers', 'applications']

POS Tagging with spaCy

spaCy provides both the Penn Treebank tag (token.tag_) and a simpler universal tag (token.pos_):

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The new model released in 2025 outperforms its predecessors significantly."
doc = nlp(text)

print(f"{'Token':<15} {'POS':<8} {'Tag':<8} {'Explanation'}")
print("-" * 55)
for token in doc:
    print(f"{token.text:<15} {token.pos_:<8} {token.tag_:<8} {spacy.explain(token.tag_)}")

# Token           POS      Tag      Explanation
# -----------------------------------------------------------
# The             DET      DT       determiner
# new             ADJ      JJ       adjective
# model           NOUN     NN       noun, singular or mass
# released        VERB     VBN      verb, past participle
# 2025            NUM      CD       cardinal number
# outperforms     VERB     VBZ      verb, 3rd person singular present
# predecessors    NOUN     NNS      noun, plural
# significantly   ADV      RB       adverb

Universal POS Tags

When working with multilingual models, universal POS tags offer a consistent set across languages:

NOUN   ADJ   VERB   ADV   PRON   DET   ADP   NUM
CONJ   PART  PUNCT  SYM   X      INTJ

spaCy’s token.pos_ returns these universal tags.

Real-World Applications

Keyword extraction — filter for nouns and noun phrases:

keywords = [token.text for token in doc
            if token.pos_ in ('NOUN', 'PROPN') and not token.is_stop]

Sentiment-aware adjective extraction:

sentiments = [(token.text, token.pos_) for token in doc if token.pos_ == 'ADJ']

Coreference — identify pronouns to resolve:

pronouns = [token.text for token in doc if token.pos_ == 'PRON']

Grammar-based chunking — extract noun phrases:

for chunk in doc.noun_chunks:
    print(chunk.text, "→", chunk.root.pos_)

POS Tagging Accuracy in 2025

Modern neural taggers built into spaCy, Stanza, and Hugging Face pipelines achieve 97–99% accuracy on standard English corpora. Accuracy drops when handling:

Domain-specific jargon (medical, legal, financial)
Code-switching (mixed-language text)
Social media text with non-standard grammar
Very low-resource languages

For specialized domains, fine-tuning a small transformer model on in-domain annotated data produces the best results.