Chunking in NLP
Chunking groups words into meaningful multi-word phrases — noun phrases, verb phrases, prepositional phrases. It sits between basic POS tagging and full syntactic parsing: more structure than a token list, less expensive than a full parse tree.
What Is a Chunk?
A chunk is a flat phrase — one level of grouping without recursive structure. The most common target is the noun phrase (NP):
"The intelligent language model generates fluent text." \_________________________/ \_________/ NP chunk NP chunkChunks give you what entities are being discussed, what actions are described, and which modifiers attach to each head noun.
Regex-Based Chunking with NLTK
NLTK’s RegexpParser matches chunk patterns over POS tag sequences:
import nltkfrom nltk import word_tokenize, pos_tag, RegexpParser
nltk.download('punkt_tab')nltk.download('averaged_perceptron_tagger_eng')
grammar = r""" NP: {<DT>?<JJ.*>*<NN.*>+} # optional determiner + adjectives + nouns {<NNP>+} # proper nouns"""
text = "The large language model produced an impressive research paper on neural networks."tokens = word_tokenize(text)tagged = pos_tag(tokens)parser = RegexpParser(grammar)tree = parser.parse(tagged)
for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'): phrase = ' '.join(word for word, tag in subtree.leaves()) print(f"NP: {phrase}")
# NP: The large language model# NP: an impressive research paper# NP: neural networksChinking — Excluding Words from Chunks
Chinking is the opposite of chunking — you define what to remove from a chunk:
grammar_with_chink = r""" NP: {<.*>+} # Chunk everything }<VBD|VBP|VBZ>+{ # Chink (exclude) finite verbs"""
text = "The developer built a scalable microservice architecture."tokens = word_tokenize(text)tagged = pos_tag(tokens)parser = RegexpParser(grammar_with_chink)result = parser.parse(tagged)result.pretty_print()Noun Chunks with spaCy
spaCy extracts noun chunks out of the box, powered by the dependency parse:
import spacynlp = spacy.load("en_core_web_sm")
text = """In 2025, open-source LLMs from companies like Meta and Mistral AIhave become competitive alternatives to proprietary APIs for enterprise NLP tasks."""doc = nlp(text)
print(f"{'Chunk':<35} {'Root':<15} {'Root POS'}")print("-" * 60)for chunk in doc.noun_chunks: print(f"{chunk.text:<35} {chunk.root.text:<15} {chunk.root.pos_}")
# open-source LLMs LLMs NOUN# companies companies NOUN# Meta Meta PROPN# Mistral AI AI PROPN# competitive alternatives alternatives NOUN# proprietary APIs APIs NOUN# enterprise NLP tasks tasks NOUNVerb Phrase Extraction
Chunking is not limited to noun phrases. Extracting verb phrases identifies actions and their arguments:
grammar_vp = r""" VP: {<VB.*><NP|PP>*} NP: {<DT>?<JJ>*<NN.*>+} PP: {<IN><NP>}"""
text = "The model can analyze sentiment and classify topics accurately."tokens = word_tokenize(text)tagged = pos_tag(tokens)parser = RegexpParser(grammar_vp)result = parser.parse(tagged)
for subtree in result.subtrees(): if subtree.label() in ('NP', 'VP', 'PP'): words = ' '.join(w for w, t in subtree.leaves()) print(f"{subtree.label()}: {words}")Information Extraction Pipeline
Chunking is a core step in rule-based information extraction:
import spacynlp = spacy.load("en_core_web_sm")
def extract_info(text): doc = nlp(text) subjects = [] objects = []
for chunk in doc.noun_chunks: if chunk.root.dep_ in ('nsubj', 'nsubjpass'): subjects.append(chunk.text) elif chunk.root.dep_ in ('dobj', 'pobj'): objects.append(chunk.text)
return {"subjects": subjects, "objects": objects}
text = "OpenAI released GPT-5 and Google launched Gemini Ultra 2."result = extract_info(text)print(result)# {'subjects': ['OpenAI', 'Google'], 'objects': ['GPT-5', 'Gemini Ultra 2']}Chunking vs Full Parsing
| Feature | Chunking | Full Parsing |
|---|---|---|
| Structure | Flat phrases | Nested tree |
| Speed | Fast | Slower |
| Accuracy | High for NPs | Higher overall |
| Implementation | Regex or neural | Statistical or neural |
| Use case | Information extraction, indexing | Grammar checking, QA, translation |
Chunking is preferred when you need speed and noun phrases are the primary target. Use full dependency parsing when you need to understand grammatical roles across the whole sentence.