Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Chunking in NLP

Chunking groups words into meaningful multi-word phrases — noun phrases, verb phrases, prepositional phrases. It sits between basic POS tagging and full syntactic parsing: more structure than a token list, less expensive than a full parse tree.


What Is a Chunk?

A chunk is a flat phrase — one level of grouping without recursive structure. The most common target is the noun phrase (NP):

"The intelligent language model generates fluent text."
\_________________________/ \_________/
NP chunk NP chunk

Chunks give you what entities are being discussed, what actions are described, and which modifiers attach to each head noun.


Regex-Based Chunking with NLTK

NLTK’s RegexpParser matches chunk patterns over POS tag sequences:

import nltk
from nltk import word_tokenize, pos_tag, RegexpParser
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
grammar = r"""
NP: {<DT>?<JJ.*>*<NN.*>+} # optional determiner + adjectives + nouns
{<NNP>+} # proper nouns
"""
text = "The large language model produced an impressive research paper on neural networks."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(grammar)
tree = parser.parse(tagged)
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
phrase = ' '.join(word for word, tag in subtree.leaves())
print(f"NP: {phrase}")
# NP: The large language model
# NP: an impressive research paper
# NP: neural networks

Chinking — Excluding Words from Chunks

Chinking is the opposite of chunking — you define what to remove from a chunk:

grammar_with_chink = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|VBP|VBZ>+{ # Chink (exclude) finite verbs
"""
text = "The developer built a scalable microservice architecture."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(grammar_with_chink)
result = parser.parse(tagged)
result.pretty_print()

Noun Chunks with spaCy

spaCy extracts noun chunks out of the box, powered by the dependency parse:

import spacy
nlp = spacy.load("en_core_web_sm")
text = """
In 2025, open-source LLMs from companies like Meta and Mistral AI
have become competitive alternatives to proprietary APIs for enterprise NLP tasks.
"""
doc = nlp(text)
print(f"{'Chunk':<35} {'Root':<15} {'Root POS'}")
print("-" * 60)
for chunk in doc.noun_chunks:
print(f"{chunk.text:<35} {chunk.root.text:<15} {chunk.root.pos_}")
# open-source LLMs LLMs NOUN
# companies companies NOUN
# Meta Meta PROPN
# Mistral AI AI PROPN
# competitive alternatives alternatives NOUN
# proprietary APIs APIs NOUN
# enterprise NLP tasks tasks NOUN

Verb Phrase Extraction

Chunking is not limited to noun phrases. Extracting verb phrases identifies actions and their arguments:

grammar_vp = r"""
VP: {<VB.*><NP|PP>*}
NP: {<DT>?<JJ>*<NN.*>+}
PP: {<IN><NP>}
"""
text = "The model can analyze sentiment and classify topics accurately."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(grammar_vp)
result = parser.parse(tagged)
for subtree in result.subtrees():
if subtree.label() in ('NP', 'VP', 'PP'):
words = ' '.join(w for w, t in subtree.leaves())
print(f"{subtree.label()}: {words}")

Information Extraction Pipeline

Chunking is a core step in rule-based information extraction:

import spacy
nlp = spacy.load("en_core_web_sm")
def extract_info(text):
doc = nlp(text)
subjects = []
objects = []
for chunk in doc.noun_chunks:
if chunk.root.dep_ in ('nsubj', 'nsubjpass'):
subjects.append(chunk.text)
elif chunk.root.dep_ in ('dobj', 'pobj'):
objects.append(chunk.text)
return {"subjects": subjects, "objects": objects}
text = "OpenAI released GPT-5 and Google launched Gemini Ultra 2."
result = extract_info(text)
print(result)
# {'subjects': ['OpenAI', 'Google'], 'objects': ['GPT-5', 'Gemini Ultra 2']}

Chunking vs Full Parsing

FeatureChunkingFull Parsing
StructureFlat phrasesNested tree
SpeedFastSlower
AccuracyHigh for NPsHigher overall
ImplementationRegex or neuralStatistical or neural
Use caseInformation extraction, indexingGrammar checking, QA, translation

Chunking is preferred when you need speed and noun phrases are the primary target. Use full dependency parsing when you need to understand grammatical roles across the whole sentence.