Chunking in NLP

Chunking groups words into meaningful multi-word phrases — noun phrases, verb phrases, prepositional phrases. It sits between basic POS tagging and full syntactic parsing: more structure than a token list, less expensive than a full parse tree.

What Is a Chunk?

A chunk is a flat phrase — one level of grouping without recursive structure. The most common target is the noun phrase (NP):

"The intelligent language model generates fluent text."
     \_________________________/           \_________/
          NP chunk                           NP chunk

Chunks give you what entities are being discussed, what actions are described, and which modifiers attach to each head noun.

Regex-Based Chunking with NLTK

NLTK’s RegexpParser matches chunk patterns over POS tag sequences:

import nltk
from nltk import word_tokenize, pos_tag, RegexpParser

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

grammar = r"""
  NP: {<DT>?<JJ.*>*<NN.*>+}     # optional determiner + adjectives + nouns
      {<NNP>+}                    # proper nouns
"""

text = "The large language model produced an impressive research paper on neural networks."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(grammar)
tree = parser.parse(tagged)

for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
    phrase = ' '.join(word for word, tag in subtree.leaves())
    print(f"NP: {phrase}")

# NP: The large language model
# NP: an impressive research paper
# NP: neural networks

Chinking — Excluding Words from Chunks

Chinking is the opposite of chunking — you define what to remove from a chunk:

grammar_with_chink = r"""
  NP:
    {<.*>+}         # Chunk everything
    }<VBD|VBP|VBZ>+{ # Chink (exclude) finite verbs
"""

text = "The developer built a scalable microservice architecture."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(grammar_with_chink)
result = parser.parse(tagged)
result.pretty_print()

Noun Chunks with spaCy

spaCy extracts noun chunks out of the box, powered by the dependency parse:

import spacy
nlp = spacy.load("en_core_web_sm")

text = """
In 2025, open-source LLMs from companies like Meta and Mistral AI
have become competitive alternatives to proprietary APIs for enterprise NLP tasks.
"""
doc = nlp(text)

print(f"{'Chunk':<35} {'Root':<15} {'Root POS'}")
print("-" * 60)
for chunk in doc.noun_chunks:
    print(f"{chunk.text:<35} {chunk.root.text:<15} {chunk.root.pos_}")

# open-source LLMs                    LLMs            NOUN
# companies                           companies       NOUN
# Meta                                Meta            PROPN
# Mistral AI                          AI              PROPN
# competitive alternatives            alternatives    NOUN
# proprietary APIs                    APIs            NOUN
# enterprise NLP tasks                tasks           NOUN

Verb Phrase Extraction

Chunking is not limited to noun phrases. Extracting verb phrases identifies actions and their arguments:

grammar_vp = r"""
  VP: {<VB.*><NP|PP>*}
  NP: {<DT>?<JJ>*<NN.*>+}
  PP: {<IN><NP>}
"""

text = "The model can analyze sentiment and classify topics accurately."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(grammar_vp)
result = parser.parse(tagged)

for subtree in result.subtrees():
    if subtree.label() in ('NP', 'VP', 'PP'):
        words = ' '.join(w for w, t in subtree.leaves())
        print(f"{subtree.label()}: {words}")

Information Extraction Pipeline

Chunking is a core step in rule-based information extraction:

import spacy
nlp = spacy.load("en_core_web_sm")

def extract_info(text):
    doc = nlp(text)
    subjects = []
    objects = []

    for chunk in doc.noun_chunks:
        if chunk.root.dep_ in ('nsubj', 'nsubjpass'):
            subjects.append(chunk.text)
        elif chunk.root.dep_ in ('dobj', 'pobj'):
            objects.append(chunk.text)

    return {"subjects": subjects, "objects": objects}

text = "OpenAI released GPT-5 and Google launched Gemini Ultra 2."
result = extract_info(text)
print(result)
# {'subjects': ['OpenAI', 'Google'], 'objects': ['GPT-5', 'Gemini Ultra 2']}

Chunking vs Full Parsing

Feature	Chunking	Full Parsing
Structure	Flat phrases	Nested tree
Speed	Fast	Slower
Accuracy	High for NPs	Higher overall
Implementation	Regex or neural	Statistical or neural
Use case	Information extraction, indexing	Grammar checking, QA, translation

Chunking is preferred when you need speed and noun phrases are the primary target. Use full dependency parsing when you need to understand grammatical roles across the whole sentence.