Technology  /  NLP

💬 Natural Language Processing 40 guides · updated 2026

From tokenisation and embeddings to transformer-based language understanding — the NLP fundamentals that underpin every modern LLM.

Elasticsearch and NLP

Elasticsearch powers full-text search with BM25 ranking out of the box. Combined with NLP — sentence embeddings, NER, and ML inference pipelines — it becomes a platform for semantic search, vector retrieval, and knowledge extraction at scale.


Architecture: How Elasticsearch + NLP Fits Together

Documents
NLP Preprocessing (spaCy / Hugging Face)
Elasticsearch Index
├── text fields (BM25 keyword search)
├── dense_vector fields (semantic similarity)
└── keyword fields (NER entities, categories)
Query → Hybrid Search (keyword + vector) → Ranked Results

Setup: Python Client

Terminal window
pip install elasticsearch sentence-transformers
docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.12.0
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
print(es.info()["version"]["number"])

Creating an Index with Dense Vector Support

from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
mapping = {
"mappings": {
"properties": {
"title": {"type": "text"},
"content": {"type": "text"},
"embedding": {
"type": "dense_vector",
"dims": 384,
"index": True,
"similarity": "cosine"
},
"entities": {
"type": "nested",
"properties": {
"text": {"type": "keyword"},
"label": {"type": "keyword"}
}
},
"category": {"type": "keyword"}
}
}
}
es.indices.create(index="nlp_articles", body=mapping, ignore=400)

Indexing Documents with Embeddings and NER

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer
import spacy
es = Elasticsearch("http://localhost:9200")
model = SentenceTransformer('all-MiniLM-L6-v2')
nlp = spacy.load("en_core_web_sm")
articles = [
{
"title": "OpenAI Releases GPT-5 with Enhanced Reasoning",
"content": "OpenAI launched GPT-5 in San Francisco with CEO Sam Altman announcing improved coding and reasoning capabilities.",
"category": "AI"
},
{
"title": "Mistral AI Raises $600M Series B",
"content": "The Paris-based Mistral AI secured $600 million in Series B funding, led by General Catalyst.",
"category": "Business"
},
{
"title": "Stanford NLP Group Releases New Stanza Version",
"content": "Stanford University's NLP group updated Stanza to support 72 languages with improved accuracy.",
"category": "Research"
}
]
for i, article in enumerate(articles):
full_text = f"{article['title']} {article['content']}"
# Generate embedding
embedding = model.encode(full_text).tolist()
# Extract entities
doc = nlp(full_text)
entities = [{"text": ent.text, "label": ent.label_} for ent in doc.ents]
# Index
es.index(
index="nlp_articles",
id=i,
document={**article, "embedding": embedding, "entities": entities}
)
es.indices.refresh(index="nlp_articles")
print(f"Indexed {len(articles)} articles")

Keyword Search (BM25)

query = "GPT artificial intelligence reasoning"
result = es.search(
index="nlp_articles",
body={
"query": {
"multi_match": {
"query": query,
"fields": ["title^2", "content"],
"type": "best_fields"
}
}
}
)
print("BM25 Results:")
for hit in result["hits"]["hits"]:
print(f" [{hit['_score']:.3f}] {hit['_source']['title']}")

Vector Search (Semantic)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
query = "company raising investment money"
query_embedding = model.encode(query).tolist()
result = es.search(
index="nlp_articles",
body={
"knn": {
"field": "embedding",
"query_vector": query_embedding,
"k": 3,
"num_candidates": 10
}
}
)
print("Semantic Search Results:")
for hit in result["hits"]["hits"]:
print(f" [{hit['_score']:.4f}] {hit['_source']['title']}")

Hybrid Search (BM25 + Vector)

Combining keyword and semantic search almost always outperforms either alone:

query_text = "neural language model new release"
query_embedding = model.encode(query_text).tolist()
result = es.search(
index="nlp_articles",
body={
"query": {
"multi_match": {
"query": query_text,
"fields": ["title^2", "content"]
}
},
"knn": {
"field": "embedding",
"query_vector": query_embedding,
"k": 3,
"num_candidates": 10,
"boost": 0.5
}
}
)
print("Hybrid Search Results:")
for hit in result["hits"]["hits"]:
print(f" [{hit['_score']:.4f}] {hit['_source']['title']}")

Searching by Named Entity

# Find articles mentioning a specific organization
result = es.search(
index="nlp_articles",
body={
"query": {
"nested": {
"path": "entities",
"query": {
"bool": {
"must": [
{"match": {"entities.label": "ORG"}},
{"match": {"entities.text": "OpenAI"}}
]
}
}
}
}
}
)
print("Articles mentioning OpenAI (ORG):")
for hit in result["hits"]["hits"]:
print(f" {hit['_source']['title']}")

ELSER — Learned Sparse Retrieval

Elasticsearch’s native learned sparse retrieval (ELSER) model combines the interpretability of keyword search with the semantic power of transformers — no external embedding model required:

# Enable in Elasticsearch (requires license)
PUT /_inference/sparse_embedding/my-elser-endpoint
{
"service": "elser",
"service_settings": { "num_allocations": 1, "num_threads": 1 }
}

ELSER produces sparse semantic vectors that are indexed efficiently using Elasticsearch’s inverted index — no dense vector storage overhead.


When to Use Elasticsearch + NLP

For smaller-scale RAG applications, simpler vector databases like Chroma, Weaviate, or Pinecone are easier to set up.