Cloud  /  AWS

AWS Amazon Web Services 61 guides · updated 2026

Hands-on guides to compute, storage, databases, networking, and serverless on the world's most widely adopted cloud platform.

Amazon Redshift: Column-Store Data Warehousing at Petabyte Scale

Operational databases like RDS and DynamoDB are optimized for transactional queries — insert a row, look up a record by primary key, update a specific field. Analytics queries look completely different: sum all sales from the past six months, calculate average order value by customer segment across 500 million orders, compare year-over-year revenue by product category. These queries touch enormous amounts of data and need completely different storage and execution architectures.

Amazon Redshift is built specifically for analytics. Its columnar storage, massively parallel processing, and deep integration with the AWS data ecosystem make it the right tool when you need to run SQL on large datasets — from hundreds of gigabytes to multiple petabytes.

Why Columnar Storage Changes Everything for Analytics

A traditional row-store database stores each row contiguously on disk. When you query a table with 50 columns but only need 3 of them, the storage layer still reads all 50 columns for every matching row. For analytics queries that scan millions or billions of rows, reading irrelevant columns is a significant waste.

Columnar storage stores each column separately. A query that reads total_amount, order_date, and customer_segment from a 50-column table reads only 3/50 of the data. This dramatically reduces I/O for typical analytics queries.

Row Store vs Column Store
==========================
Row Store (RDS/Aurora/transactional DB):
Disk: [id=1, name=Alice, age=29, country=US, amount=99.99, ...]
Disk: [id=2, name=Bob, age=34, country=CA, amount=245.00, ...]
Query: SELECT SUM(amount) FROM orders WHERE country='US'
Must read entire rows including name, age, and 45 other columns
Column Store (Redshift):
Disk (amount column): [99.99, 245.00, 12.50, 500.00, ...]
Disk (country column): [US, CA, US, US, ...]
Query: SELECT SUM(amount) FROM orders WHERE country='US'
Reads only 2 columns out of 50
Compression per column is also more effective (similar values together)

Column data also compresses dramatically better because adjacent values in a column are often similar (e.g., a country column might have thousands of repeated “US” values). Redshift automatically selects compression encodings per column during COPY operations or with the ANALYZE COMPRESSION command.

Massively Parallel Processing (MPP)

A Redshift cluster consists of a leader node and multiple compute nodes. The leader node receives queries, generates a query plan, distributes work to compute nodes, and aggregates results. Each compute node processes a subset of the data in parallel.

Redshift MPP Architecture
==========================
SQL Client / BI Tool
Leader Node (query planning, result aggregation)
┌────┼────┐
▼ ▼ ▼
Node1 Node2 Node3 (each processes its slice of data)
│ │ │
Disk Disk Disk (each node holds a portion of each table)
A query on a 10 TB table with 10 nodes:
each node processes ~1 TB in parallel
total elapsed time ≈ time for 1 TB query

The number and type of compute nodes determines the cluster’s performance and cost. RA3 nodes (the current generation) use managed storage that separates compute and storage — you scale compute and storage independently, and data is cached locally on the compute nodes with overflow to S3.

Data Distribution Styles

How Redshift distributes rows across compute nodes significantly impacts query performance. Poor distribution causes data skew (some nodes have much more data than others) and excessive data movement between nodes during joins.

EVEN: rows are distributed in round-robin fashion across nodes. Good default for tables without obvious join patterns.

KEY: rows with the same value in the distribution column go to the same node. When two tables are joined on the distribution key, the joined rows are already co-located, eliminating network data movement.

ALL: a full copy of the entire table is placed on every node. Use this for small dimension tables (under a few million rows) that are joined frequently — no matter which node the fact table data is on, the dimension table is already there.

AUTO: Redshift chooses the distribution style based on table size, changing automatically as the table grows.

Sort Keys

Sort keys define the physical order of data on disk within a table. Queries that filter on the sort key benefit from the zone maps — metadata that tracks the minimum and maximum values per 1 MB disk block, letting Redshift skip entire blocks without reading them.

Sort Key Benefit
=================
Table: sales, sorted by order_date
Block 1: order_date 2024-01-01 to 2024-01-15 (min/max in zone map)
Block 2: order_date 2024-01-16 to 2024-01-31
Block 3: order_date 2024-02-01 to 2024-02-28
...
Block 20: order_date 2024-10-01 to 2024-10-31
Query: WHERE order_date BETWEEN '2024-10-01' AND '2024-10-31'
Redshift reads only Block 20 — skips 19 blocks entirely
Compound sort key: (year, month, order_date) for hierarchical filtering
Interleaved sort key: equal weight on multiple columns for varied filter patterns

Redshift Serverless

Redshift Serverless removes the need to size and manage clusters. You configure a capacity measured in Redshift Processing Units (RPUs) with a maximum to control costs. The service scales automatically based on workload, charges per second of compute used, and scales to zero when idle.

This is appropriate for:

For continuous, heavy analytics workloads, provisioned clusters with reserved instance pricing are typically more cost-effective.

Redshift Spectrum

Spectrum extends Redshift queries to data stored in S3 — without loading the data into Redshift first. Using an external schema pointing to an AWS Glue Data Catalog database, you can join Redshift tables with S3 data as if it were all in the same database.

Redshift Spectrum Architecture
================================
SELECT r.customer_id, s.event_type, r.total_spent
FROM redshift_table r
JOIN spectrum_schema.clickstream s ON r.customer_id = s.user_id
WHERE s.event_date >= '2025-06-01'
Execution:
├── redshift_table → read from Redshift cluster (local)
└── clickstream → Spectrum reads from S3 (Parquet files in data lake)
Spectrum nodes run in parallel, push-down filters to S3
Only matching rows returned to Redshift leader
No ETL required to query S3 data

This enables a tiered data architecture: recent, frequently queried data loaded into Redshift; historical data kept in S3 in a cost-effective format like Parquet, queried via Spectrum when needed.

Real-World Use Case: Retail Analytics Platform

A retailer processes 50 million transactions per day. The analytics team runs:

Architecture:

Key Interview Points