Parsing in NLP

Parsing converts a sentence into a structured representation that captures grammatical relationships. It is one of the most fundamental steps in linguistic analysis and serves as the foundation for question answering, information extraction, and grammar checking.

Constituency Parsing

Constituency parsing divides a sentence into nested phrases, forming a hierarchical tree:

Sentence: "A young scientist published the groundbreaking paper."

          S
         / \
        NP   VP
       /|\    |   \
      DT JJ  NN  VBD   NP
      |  |   |    |   /    \
      A young sc. pub. DT     JJ        NN
                       the groundbreaking paper

Each node represents a constituent — a phrase that forms a meaningful unit.

Constituency Parsing with NLTK

NLTK ships with several parsers including Earley and chart parsers for CFGs:

import nltk

grammar = nltk.CFG.fromstring("""
  S  -> NP VP
  NP -> DT JJ NN | DT NN | NNP
  VP -> VBD NP | VBZ NP
  DT -> 'A' | 'a' | 'the' | 'The'
  JJ -> 'young' | 'groundbreaking' | 'new'
  NN -> 'scientist' | 'paper' | 'model'
  NNP -> 'Alice' | 'OpenAI'
  VBD -> 'published' | 'released'
  VBZ -> 'studies' | 'analyzes'
""")

parser = nltk.EarleyChartParser(grammar)
sentence = "A young scientist published the groundbreaking paper".split()

for tree in parser.parse(sentence):
    tree.pretty_print()

Probabilistic Parsing (PCFG)

A Probabilistic CFG assigns probabilities to each rule, allowing the parser to choose the most likely parse when a sentence is ambiguous:

import nltk

pcfg_grammar = nltk.PCFG.fromstring("""
  S    -> NP VP          [1.0]
  NP   -> DT NN          [0.7]
  NP   -> NNP            [0.3]
  VP   -> VBD NP         [0.6]
  VP   -> VBZ JJ         [0.4]
  DT   -> 'the'          [0.8]
  DT   -> 'a'            [0.2]
  NN   -> 'model'        [0.5]
  NN   -> 'paper'        [0.5]
  NNP  -> 'Alice'        [1.0]
  VBD  -> 'released'     [1.0]
  VBZ  -> 'is'           [1.0]
  JJ   -> 'impressive'   [1.0]
""")

viterbi_parser = nltk.ViterbiParser(pcfg_grammar)
for tree in viterbi_parser.parse("the model released a paper".split()):
    print(tree)
    print(f"Probability: {tree.prob():.6f}")

Dependency Parsing with spaCy

spaCy uses neural dependency parsing which is faster and more accurate than CFG-based constituency parsing for most practical tasks:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "Researchers at Stanford published a landmark study on LLM reasoning."
doc = nlp(text)

print(f"{'Token':<15} {'Dep':<12} {'Head':<15} {'Children'}")
print("-" * 65)
for token in doc:
    children = [c.text for c in token.children]
    print(f"{token.text:<15} {token.dep_:<12} {token.head.text:<15} {children}")

Visualizing the Parse

from spacy import displacy

doc = nlp("The new AI assistant understands complex multi-step questions.")
displacy.render(doc, style="dep", jupyter=True)
# In a script: displacy.serve(doc, style="dep")

Parsing for Practical Applications

Relation extraction — find subject-verb-object triples:

def extract_svo(doc):
    triples = []
    for token in doc:
        if token.dep_ == "ROOT":
            subject = [w for w in token.lefts if w.dep_ in ("nsubj", "nsubjpass")]
            obj = [w for w in token.rights if w.dep_ in ("dobj", "pobj")]
            if subject and obj:
                triples.append((subject[0].text, token.text, obj[0].text))
    return triples

doc = nlp("Google acquired DeepMind and OpenAI partnered with Microsoft.")
print(extract_svo(doc))
# [('Google', 'acquired', 'DeepMind')]

Question generation — invert subject-verb order using parse structure.

Grammar checking — detect subject-verb disagreement by traversing the dependency tree.

Performance on Modern Hardware

spaCy’s en_core_web_sm processes ~12,000 tokens per second on a CPU. The larger en_core_web_trf (transformer-based) model achieves higher accuracy at ~400 tokens per second. For production pipelines processing millions of documents, batching with nlp.pipe() is essential:

texts = ["sentence one", "sentence two"]
for doc in nlp.pipe(texts, batch_size=64):
    pass  # process each doc