Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Syntax in NLP

Syntax is the study of how words combine to form phrases and sentences. In NLP, syntactic analysis gives a machine a structured view of a sentence — which words are heads, which are modifiers, and how they all fit together.


Two Views of Sentence Structure

NLP uses two primary frameworks for representing syntax:

Constituency (Phrase Structure) — sentences are nested phrases (NP, VP, PP). Focus is on grouping.

Dependency — words link to each other in head-modifier relationships. Focus is on grammatical function.

Sentence: "The fast robot processed the data."
Constituency:
[S [NP The fast robot] [VP processed [NP the data]]]
Dependency:
processed → robot (nsubj)
processed → data (dobj)
robot → The (det)
robot → fast (amod)
data → the (det)

Context-Free Grammars (CFG)

A context-free grammar defines the valid rules for combining tokens into phrases:

import nltk
grammar = nltk.CFG.fromstring("""
S -> NP VP
NP -> DT JJ NN | DT NN | NN
VP -> VBD NP | VBD
DT -> 'The' | 'the' | 'a'
JJ -> 'fast' | 'large'
NN -> 'robot' | 'data' | 'model'
VBD -> 'processed' | 'analyzed'
""")
parser = nltk.ChartParser(grammar)
sentence = "The fast robot processed the data".split()
for tree in parser.parse(sentence):
tree.pretty_print()
tree.draw() # opens a GUI window

Output tree structure:

S
____|____
NP VP
__|___ __|__
DT JJ NN VBD NP
| | | | _|__
The fast robot proc. DT NN
| |
the data

Noun Phrase Chunking with NLTK

Chunking uses regex patterns over POS tags to extract phrases:

import nltk
from nltk import pos_tag, word_tokenize, RegexpParser
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN.*>+} # Noun phrase
VP: {<VB.*><NP>?} # Verb phrase
"""
text = "The advanced language model processed a large dataset efficiently."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(chunk_grammar)
result = parser.parse(tagged)
for subtree in result.subtrees():
if subtree.label() in ('NP', 'VP'):
print(f"{subtree.label()}: {' '.join(w for w, t in subtree.leaves())}")
# NP: The advanced language model
# VP: processed
# NP: a large dataset

Syntactic Analysis with spaCy

spaCy provides full syntactic analysis in a single pipeline call:

import spacy
nlp = spacy.load("en_core_web_sm")
text = "The startup launched its new AI product at the conference in San Francisco."
doc = nlp(text)
# Noun chunks — consecutive noun phrase segments
print("Noun chunks:")
for chunk in doc.noun_chunks:
print(f" {chunk.text:<30} root: {chunk.root.text}")
# Syntactic relations
print("\nDependencies:")
for token in doc:
print(f" {token.text:<15} --[{token.dep_}]--> {token.head.text}")

Visualizing Syntax Trees

from spacy import displacy
doc = nlp("Researchers published an influential paper on transformer models.")
displacy.render(doc, style="dep", jupyter=True)
# In a script:
# displacy.serve(doc, style="dep")

Why Syntax Still Matters

In the era of transformer models (2025), raw syntax trees are less commonly used as explicit features. But syntactic knowledge shapes:


Formal Grammars vs Neural Parsers

ApproachAccuracySpeedFlexibility
Rule-based CFGMediumFastLow
Statistical parsers (2010s)GoodMediumMedium
Neural parsers (spaCy, Stanza)HighFastHigh
Fine-tuned BERT parsersVery highModerateVery high

Modern neural parsers trained on treebanks like Penn Treebank or Universal Dependencies achieve 95%+ labeled attachment score on English.