AWS Redshift Architecture: How MPP and Columnar Storage Deliver Analytical Speed

A query that asks “what was our total revenue by product category in Q3?” is simple to write. Running it over a billion rows in under a second requires a database architecture specifically designed for that pattern — not a general-purpose row store. Amazon Redshift was built from the ground up for analytical workloads, and understanding its architecture explains why it is fast where row stores are slow.

The Three Foundational Principles

Before looking at components, three architectural decisions explain almost every performance characteristic of Redshift.

Massively Parallel Processing (MPP)

Redshift distributes data and query execution across multiple nodes. When a query runs, the leader node breaks it into subtasks and dispatches them to all compute nodes simultaneously. Each node processes its local slice of the data and returns partial results. The leader node aggregates them into the final answer.

Query: SELECT SUM(revenue) FROM sales WHERE region = 'West'

Leader Node
  |-- Parses the query
  |-- Builds an execution plan
  |-- Dispatches subtasks to all compute nodes
  |
  +-- Compute Node 1: SUM(revenue) WHERE region='West' on slices 0-3
  +-- Compute Node 2: SUM(revenue) WHERE region='West' on slices 4-7
  +-- Compute Node 3: SUM(revenue) WHERE region='West' on slices 8-11
  |
  |-- Each node returns its partial sum
  |-- Leader aggregates: 45,231 + 67,890 + 38,441 = 151,562
  |
Client receives: 151,562

Adding nodes scales both storage capacity and query throughput roughly linearly.

Columnar Storage

Traditional row-oriented databases store each row contiguously on disk. To sum a single column across a million rows, the database reads all the data for every row — most of which is irrelevant to the query.

Redshift stores each column separately. To sum revenue, it reads only the revenue column file — nothing else. This reduces I/O by a large factor for typical analytical queries that touch 3-5 columns out of tables with 30-50 columns.

Row-oriented storage (traditional):
|sale_id|date      |customer|product   |qty|price|region|
|1001   |2024-01-01|C-5821  |Widget A  |3  |29.99|West  |
|1002   |2024-01-01|C-3341  |Widget B  |1  |49.99|East  |
... (must read entire row to get price column)

Column-oriented storage (Redshift):
price column file:  [29.99, 49.99, 15.99, 89.99, ...]
region column file: [West, East, West, North, ...]
... (to SUM price WHERE region='West', read ONLY these two files)

Additionally, data within a column is of the same type and often has low cardinality or follows a pattern. This compresses far better than mixed-type row data. Redshift uses encoding schemes like delta encoding for timestamps, run-length encoding for low-cardinality strings, and byte-dictionary encoding for repeated values.

Separation of Compute and Storage (RA3 Nodes)

RA3 nodes use local SSDs as a high-performance cache and Amazon S3 as the persistent backing store. Only frequently accessed (“hot”) data sits on local SSDs. Cold data lives on S3 and is fetched on demand.

This means:

Storage scales independently from compute
You can resize the cluster (add/remove compute nodes) without moving data
You pay S3 rates for storage, which is much cheaper than provisioned SSD

Cluster Architecture

                Client (BI Tool, SQL Client)
                          |
                    [Leader Node]
                   /      |      \
                  /       |       \
         [Compute  [Compute  [Compute
          Node 1]   Node 2]   Node N]
          /   \      /   \      /   \
       Slice Slice  Slice Slice Slice Slice
       0     1      2     3     4     5
         |              |              |
    Local SSD      Local SSD      Local SSD
    (hot cache)    (hot cache)    (hot cache)
              \         |         /
               [Amazon S3 - backing store (RA3)]

Leader Node

The leader node handles client connections and query planning. It does not store user data. When a query arrives:

Parse the SQL
Build a query execution plan — which nodes do what, in what order, how data moves between nodes
Compile the plan into code
Distribute compiled code to compute nodes
Wait for results from all compute nodes
Aggregate partial results and return the final answer to the client

The leader node also manages metadata: table definitions, statistics, access control, and query queue state.

Compute Nodes

Each compute node processes queries on its assigned data. Nodes run in parallel and do not share data with each other except during query execution when the leader node orchestrates data redistribution for joins.

Each compute node is divided into slices. A slice is an independent processing unit with its own CPU, memory, and disk partition. The number of slices depends on the node type — dc2.large has 2 slices, ra3.4xlarge has 4 slices, ra3.16xlarge has 16 slices.

Data is distributed at the slice level, not the node level.

Data Distribution Styles

How you distribute data across slices is the single biggest architectural decision in a Redshift design. It determines whether joins require data movement between nodes (slow) or happen locally on each slice (fast).

KEY distribution: You designate one column as the distribution key. Rows with the same key value land on the same slice. Use this for large fact tables where you join frequently on a specific column.

orders table (DISTKEY: customer_id)
  Slice 0: customer_id 1, 5, 9, 13, ...
  Slice 1: customer_id 2, 6, 10, 14, ...
  Slice 2: customer_id 3, 7, 11, 15, ...
  Slice 3: customer_id 4, 8, 12, 16, ...

customers table (DISTKEY: customer_id)
  Same distribution -- matching rows co-located on same slice

JOIN orders ON customers.customer_id = orders.customer_id
  --> Local join on each slice -- no data movement -- fast

ALL distribution: A full copy of the table exists on every compute node. Use for small dimension tables (fewer than a few million rows). Any join to this table is always a local join — the dimension data is always co-located with the fact data.

EVEN distribution: Data is distributed round-robin. Simple and balanced, but joins between EVEN-distributed tables require broadcasting or redistributing data. Use for staging tables that do not participate in joins.

AUTO distribution: Redshift manages the style. Small tables get ALL, larger ones get EVEN. Redshift may reassign the style as the table grows. Convenient for beginners; less predictable for performance-critical tables.

Sort Keys

Sort keys define the physical order of data on disk within each slice. Redshift maintains zone maps — metadata recording the minimum and maximum sort key value in each 1 MB data block.

When a query filters on the sort key column, Redshift checks zone maps and skips entire blocks that cannot contain matching rows. On a well-sorted table with a date sort key, a query for a single day may skip 99% of the data.

sales table, SORTKEY(sale_date)

Block 1: min=2024-01-01, max=2024-01-31
Block 2: min=2024-02-01, max=2024-02-28
Block 3: min=2024-03-01, max=2024-03-31
...

Query: WHERE sale_date = '2024-02-15'
  --> Block 1: max < 2024-02-15? No. Skip.
  Wait -- min=2024-01-01, max=2024-01-31 < 2024-02-15. Skip.
  --> Block 2: 2024-02-15 is between min and max. Read.
  --> Block 3: min > 2024-02-15. Skip.
  Result: only Block 2 is read.

Compound sort key: Sorts by multiple columns in order — first by year, then month, then day. Efficient for range queries on the leading column.

Interleaved sort key: Gives equal weight to each column. Useful when queries filter on any one of several columns without a preferred leading column. Higher maintenance cost — vacuuming takes longer.

The VACUUM and ANALYZE Operations

VACUUM: Reclaims space from deleted rows and re-sorts data that was inserted out of order. Redshift does not delete rows physically on DELETE — it marks them as deleted and skips them during scans. VACUUM removes those markers and compacts the table.

Auto VACUUM runs in the background automatically. For large tables with heavy churn, you may need to run VACUUM manually or tune the Auto VACUUM settings.

ANALYZE: Updates statistics on column distribution, which the query planner uses to choose join order and execution strategy. Run ANALYZE after large bulk loads. Auto ANALYZE runs automatically but can lag behind for fast-changing tables.

Interview Notes

Q: What is the leader node’s role? Does it store data? The leader node manages client connections, parses queries, builds execution plans, and aggregates results. It does not store user data. It is single-instance but AWS manages its availability.

Q: What is a slice in Redshift? A slice is a partition within a compute node, with its own CPU, memory, and disk. Data is distributed at the slice level. A node with 4 slices runs 4 parallel data processing streams. The number of slices determines the maximum parallelism within a cluster.

Q: Why is columnar storage better for analytics but worse for OLTP? Analytical queries read many rows but few columns — columnar is fast because it skips irrelevant columns. OLTP queries read or update entire rows frequently — row storage is faster because the full row is in one place on disk. A single-row lookup in Redshift reads an entire block per column, which is wasteful for transactional access patterns.

Q: What happens if you choose a poor distribution key? Data skew: some slices hold far more rows than others. One slice becomes the bottleneck — it processes disproportionately more work while others sit idle. Overall query time is limited by the slowest slice. Signs of skew: high skew_rows in SVV_TABLE_INFO or uneven block counts across slices.