Amazon Neptune: Managed Graph Database for Connected Data and Relationship Queries
Relational databases answer questions like “what are the orders for customer ID 1234?” Graph databases answer questions like “which customers who bought product A also bought product B, and what other products did their friends purchase?” The difference is that graph databases treat relationships as first-class citizens — stored, indexed, and traversable at the same speed as the data itself.
In a relational database, discovering a relationship requires a JOIN. Joining through multiple relationships requires multiple JOINs, and query performance degrades with each hop. A social network query asking “who are my second-degree connections who live in Chicago and work in finance?” might require five or six JOINs across large tables. The same query in a graph database is a traversal — follow edges from node to node — and performs well regardless of how many hops are involved.
Amazon Neptune is AWS’s fully managed graph database service. It supports two graph models and three query languages, handling billions of nodes and edges with millisecond query latency.
Two Graph Models
Neptune supports two fundamentally different ways of modeling graph data:
Property Graph: entities are vertices (nodes) and relationships are edges. Both vertices and edges can have properties — key-value pairs that store attributes. This model is intuitive for most developers because it maps naturally to how we think about objects and their relationships.
Property Graph Model ====================
Vertex (node): ┌─────────────────┐ │ User │ │ id: u1001 │ │ name: "Alice" │ │ city: "NYC" │ └─────────────────┘ │ │ Edge: FOLLOWS (since: 2024-01-15) ▼ ┌─────────────────┐ │ User │ │ id: u2002 │ │ name: "Bob" │ │ city: "Boston" │ └─────────────────┘ │ │ Edge: LIKED (at: 2025-06-10) ▼ ┌─────────────────┐ │ Post │ │ id: p3003 │ │ title: "..." │ └─────────────────┘
Query: "Show posts liked by people Alice follows" → Traverse FOLLOWS edge → traverse LIKED edges → return postsRDF (Resource Description Framework): models data as triples — Subject, Predicate, Object. Every fact is expressed as a triple: (Alice, worksAt, AcmeCorp). RDF is the foundation of semantic web standards and is widely used in knowledge graphs, life sciences, and linked data applications.
Query Languages
Neptune supports three query languages:
Gremlin: the traversal language for Property Graph. Queries are written as chains of steps. Part of Apache TinkerPop, a widely adopted graph computing framework.
OpenCypher: a declarative graph query language (the same used by Neo4j’s Cypher) now available in Neptune. More readable than Gremlin for developers familiar with SQL-like syntax.
SPARQL: the query language for RDF graphs. SQL-like syntax with explicit triple patterns. Used predominantly in academic, biomedical, and enterprise knowledge graph applications.
Architecture
Neptune uses the same distributed storage architecture as Aurora — six copies across three Availability Zones with quorum-based writes.
Neptune Cluster Architecture ==============================
Application │ ├──► Writer endpoint → Primary instance └──► Reader endpoint → Read replicas (up to 15)
Primary + Replicas │ ┌───▼────────────────────────────────────────────┐ │ Neptune Storage Volume (6 copies, 3 AZs) │ │ Stores both vertices/edges and their indexes │ │ Auto-scales, no capacity management │ └────────────────────────────────────────────────┘
Failover to replica: typically < 30 seconds Storage: up to 64 TB per clusterKey Use Cases
Social network analysis: model users as vertices, friendships/follows as edges. Queries that would require expensive recursive SQL become simple graph traversals. Finding all connections within three hops, detecting communities, or computing influence scores are natural graph operations.
Fraud detection: financial transaction graphs reveal patterns invisible in tabular data. When account A sends money to account B, and B sends to C, and C sends back to A — that circular flow is suspicious. A graph query detects this in milliseconds; equivalent SQL requires complex recursive CTEs.
Fraud Detection Graph Pattern ==============================
Normal transaction pattern: Alice → [sends $100] → Bob → [sends $50] → Store
Suspicious circular pattern: Alice → [sends $1000] → Shell-Co-1 → [sends $950] ↑ │ └──────────── Shell-Co-2 ◄───────────────┘
Gremlin query to find cycles starting from suspicious accounts: g.V().has('type', 'account').as('start') .out('SENT').out('SENT').out('SENT') .where(eq('start')) .path()Recommendation engines: “customers who bought X also bought Y” is a graph traversal. Walk from a product vertex, through PURCHASED edges to customer vertices, then through other PURCHASED edges to other products, aggregating by frequency.
Knowledge graphs: enterprise knowledge graphs connect people, organizations, technologies, projects, and documents. Pharmaceutical companies use Neptune to connect drugs, proteins, diseases, genes, and research papers — enabling discovery queries that are impractical in relational databases.
Network topology: AWS infrastructure dependencies, application service graphs, and IT network topologies are naturally modeled as graphs. Root cause analysis — “which service failure could have caused this set of symptoms” — is a graph reachability problem.
Neptune Streams
Neptune Streams captures every change to graph data as an ordered stream of records. Every create, update, or delete of a vertex, edge, or property is recorded in sequence. Applications read from the stream to update downstream search indexes, trigger alerts when suspicious connections appear, or replicate graph data to other stores. Records are retained for 7 days.
Neptune Serverless
Neptune Serverless scales compute capacity automatically in Neptune Capacity Units (NCUs) based on actual query load, scaling to near-zero when idle. Appropriate for development environments, applications with unpredictable query traffic, and use cases where the graph is read infrequently.
Real-World Use Case: Drug Discovery Knowledge Graph
A pharmaceutical company builds a knowledge graph connecting drugs, proteins, diseases, genes, and published research. The graph has 500 million nodes and 2 billion edges. A researcher queries: “Find all proteins associated with Alzheimer’s disease that are inhibited by approved drugs, but have not been studied in combination therapy trials.”
This query traverses multiple relationship types across billions of edges. Neptune returns results in seconds. An equivalent relational query would require joins across six tables with billions of rows — potentially hours of query time, if it could be expressed in SQL at all.
Key Interview Points
- Neptune supports three query languages: Gremlin (property graph traversal), OpenCypher (declarative property graph), and SPARQL (RDF) — each suited to different use cases
- Graph database is not a replacement for relational — use Neptune when relationships between entities are central to your queries; use RDS/Aurora for tabular, structured data
- Neptune does not support arbitrary SQL — if your team queries with SQL, Redshift or RDS is the right service
- Bulk loading: use the Neptune Loader to ingest large CSV or RDF files from S3; much faster than individual CRUD operations for initial data population
- IAM authentication: Neptune supports IAM-based authentication for HTTP requests, eliminating the need to manage database credentials
- Gremlin vs OpenCypher: if your team has Neo4j experience, OpenCypher will be more familiar; SPARQL is specifically for RDF/semantic web use cases
- Storage auto-scales up to 64 TB per cluster with no manual capacity management