Syntax in NLP

Syntax is the study of how words combine to form phrases and sentences. In NLP, syntactic analysis gives a machine a structured view of a sentence — which words are heads, which are modifiers, and how they all fit together.

Two Views of Sentence Structure

NLP uses two primary frameworks for representing syntax:

Constituency (Phrase Structure) — sentences are nested phrases (NP, VP, PP). Focus is on grouping.

Dependency — words link to each other in head-modifier relationships. Focus is on grammatical function.

Sentence: "The fast robot processed the data."

Constituency:
  [S [NP The fast robot] [VP processed [NP the data]]]

Dependency:
  processed → robot  (nsubj)
  processed → data   (dobj)
  robot → The        (det)
  robot → fast       (amod)
  data → the         (det)

Context-Free Grammars (CFG)

A context-free grammar defines the valid rules for combining tokens into phrases:

import nltk

grammar = nltk.CFG.fromstring("""
  S  -> NP VP
  NP -> DT JJ NN | DT NN | NN
  VP -> VBD NP | VBD
  DT -> 'The' | 'the' | 'a'
  JJ -> 'fast' | 'large'
  NN -> 'robot' | 'data' | 'model'
  VBD -> 'processed' | 'analyzed'
""")

parser = nltk.ChartParser(grammar)
sentence = "The fast robot processed the data".split()

for tree in parser.parse(sentence):
    tree.pretty_print()
    tree.draw()  # opens a GUI window

Output tree structure:

         S
     ____|____
    NP        VP
  __|___     __|__
 DT JJ NN  VBD   NP
 |  |  |    |   _|__
The fast robot proc. DT  NN
                     |    |
                    the  data

Noun Phrase Chunking with NLTK

Chunking uses regex patterns over POS tags to extract phrases:

import nltk
from nltk import pos_tag, word_tokenize, RegexpParser

nltk.download('averaged_perceptron_tagger_eng')
nltk.download('punkt_tab')

chunk_grammar = r"""
  NP: {<DT>?<JJ>*<NN.*>+}   # Noun phrase
  VP: {<VB.*><NP>?}          # Verb phrase
"""

text = "The advanced language model processed a large dataset efficiently."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
parser = RegexpParser(chunk_grammar)
result = parser.parse(tagged)

for subtree in result.subtrees():
    if subtree.label() in ('NP', 'VP'):
        print(f"{subtree.label()}: {' '.join(w for w, t in subtree.leaves())}")

# NP: The advanced language model
# VP: processed
# NP: a large dataset

Syntactic Analysis with spaCy

spaCy provides full syntactic analysis in a single pipeline call:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The startup launched its new AI product at the conference in San Francisco."
doc = nlp(text)

# Noun chunks — consecutive noun phrase segments
print("Noun chunks:")
for chunk in doc.noun_chunks:
    print(f"  {chunk.text:<30} root: {chunk.root.text}")

# Syntactic relations
print("\nDependencies:")
for token in doc:
    print(f"  {token.text:<15} --[{token.dep_}]--> {token.head.text}")

Visualizing Syntax Trees

from spacy import displacy

doc = nlp("Researchers published an influential paper on transformer models.")
displacy.render(doc, style="dep", jupyter=True)
# In a script:
# displacy.serve(doc, style="dep")

Why Syntax Still Matters

In the era of transformer models (2025), raw syntax trees are less commonly used as explicit features. But syntactic knowledge shapes:

Grammar checking tools — identifying agreement errors, wrong tense, misplaced modifiers
Question generation — converting declarative sentences to questions requires understanding heads and subjects
Relation extraction — finding “who did what to whom” in structured pipelines
Low-resource NLP — for languages where deep learning models are not available, grammar-based systems are still practical
Interpretability — understanding what a model is attending to by comparing with syntactic structure

Formal Grammars vs Neural Parsers

Approach	Accuracy	Speed	Flexibility
Rule-based CFG	Medium	Fast	Low
Statistical parsers (2010s)	Good	Medium	Medium
Neural parsers (spaCy, Stanza)	High	Fast	High
Fine-tuned BERT parsers	Very high	Moderate	Very high

Modern neural parsers trained on treebanks like Penn Treebank or Universal Dependencies achieve 95%+ labeled attachment score on English.