Stanford CoreNLP
Stanford CoreNLP is a comprehensive Java-based NLP toolkit developed at Stanford University. It provides a wide range of linguistic annotations including tokenization, POS tagging, NER, coreference resolution, sentiment analysis, dependency parsing, and more — all in a single integrated pipeline.
Two Ways to Use CoreNLP from Python
Option 1: Stanza — Stanford’s official Python NLP library (recommended). Implements the same algorithms natively in Python.
Option 2: CoreNLP Server + py-corenlp — Run the Java CoreNLP server and call it from Python via REST API.
Option 1: Stanza (Recommended)
Stanza is Stanford’s pure-Python NLP library with the same models and supports 70+ languages:
pip install stanzaimport stanza
# Download English modelsstanza.download('en')
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,depparse,ner')
text = "Apple CEO Tim Cook announced the new Mac Studio at the company's Cupertino headquarters in March 2025."doc = nlp(text)
# Tokens and POSprint("=== POS Tags ===")for sent in doc.sentences: for word in sent.words: print(f"{word.text:<20} pos: {word.pos:<8} lemma: {word.lemma}")
# Named Entitiesprint("\n=== Named Entities ===")for ent in doc.ents: print(f"{ent.text:<25} type: {ent.type}")
# Dependency Parseprint("\n=== Dependencies ===")for sent in doc.sentences: for word in sent.words: head = sent.words[word.head - 1].text if word.head > 0 else "ROOT" print(f"{word.text:<20} deprel: {word.deprel:<10} head: {head}")Multilingual Processing with Stanza
import stanza
# Process multiple languages with the same interfacelanguages = { 'en': "The transformer model achieved state-of-the-art results.", 'fr': "Le modèle de transformateur a obtenu des résultats de pointe.", 'de': "Das Transformer-Modell erzielte modernste Ergebnisse.", 'zh': "变换器模型取得了最先进的结果。"}
for lang_code, text in languages.items(): stanza.download(lang_code, verbose=False) nlp = stanza.Pipeline(lang_code, verbose=False) doc = nlp(text) tokens = [word.text for sent in doc.sentences for word in sent.words] print(f"{lang_code}: {tokens}")Coreference Resolution
Coreference resolution identifies when multiple mentions in a text refer to the same entity — one of CoreNLP’s most distinctive features:
# Run CoreNLP server (requires Java 8+)# Download from: https://stanfordnlp.github.io/CoreNLP/java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer \ -port 9000 -timeout 15000# pip install pycorenlpfrom pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
text = """Mary said she would finish the NLP project by Friday.She mentioned that her team had already completed the data preprocessing step."""
result = nlp.annotate(text, properties={ 'annotators': 'tokenize,ssplit,pos,lemma,ner,dcoref', 'outputFormat': 'json', 'timeout': 15000})
print("Coreference chains:")for chain_id, chain in result['corefs'].items(): mentions = [(m['text'], m['sentNum'], m['position'][1]) for m in chain] print(f"Chain {chain_id}: {mentions}")
# Chain 0: [('Mary', 1, 1), ('she', 1, 3), ('She', 2, 1), ('her', 2, 4)]# Chain 1: [('the NLP project', 1, 6), ('the data preprocessing step', 2, 8)]Sentiment Analysis
CoreNLP’s sentiment analyzer scores each sentence on a 5-point scale (Very Negative to Very Positive):
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
reviews = [ "The new model is absolutely brilliant and works flawlessly.", "Terrible performance, crashes constantly, completely unusable.", "The library works fine, documentation could be better."]
sentiment_map = {0: "Very Negative", 1: "Negative", 2: "Neutral", 3: "Positive", 4: "Very Positive"}
for review in reviews: result = nlp.annotate(review, properties={ 'annotators': 'sentiment', 'outputFormat': 'json' }) for sentence in result["sentences"]: score = int(sentence["sentimentValue"]) print(f"[{sentiment_map[score]}] {review}")CoreNLP Annotators Reference
| Annotator | What it does | Requires |
|---|---|---|
tokenize | Split text into tokens | — |
ssplit | Sentence splitting | tokenize |
pos | POS tagging | tokenize, ssplit |
lemma | Lemmatization | pos |
ner | Named entity recognition | pos, lemma |
depparse | Dependency parsing | pos |
coref/dcoref | Coreference resolution | ner, depparse |
sentiment | Sentiment per sentence | pos |
openie | Open information extraction | depparse |
kbp | Relation extraction | ner |
CoreNLP vs spaCy vs Stanza
| Feature | CoreNLP (Java) | spaCy | Stanza |
|---|---|---|---|
| Language | Java (Python via server) | Python | Python |
| Speed | Slow | Fast | Medium |
| Coreference | Excellent | Limited | Limited |
| Languages | 6 | 20+ | 70+ |
| Ease of use | Complex | Easy | Easy |
| Accuracy | High | High | High |
| Streaming corpora | No | Yes | Yes |
CoreNLP is the right choice when you need coreference resolution or relation extraction that other libraries lack. For standard NLP tasks, Stanza provides the same Stanford algorithms with a much simpler Python interface.