Amazon Redshift: Column-Store Data Warehousing at Petabyte Scale

Operational databases like RDS and DynamoDB are optimized for transactional queries — insert a row, look up a record by primary key, update a specific field. Analytics queries look completely different: sum all sales from the past six months, calculate average order value by customer segment across 500 million orders, compare year-over-year revenue by product category. These queries touch enormous amounts of data and need completely different storage and execution architectures.

Amazon Redshift is built specifically for analytics. Its columnar storage, massively parallel processing, and deep integration with the AWS data ecosystem make it the right tool when you need to run SQL on large datasets — from hundreds of gigabytes to multiple petabytes.

Why Columnar Storage Changes Everything for Analytics

A traditional row-store database stores each row contiguously on disk. When you query a table with 50 columns but only need 3 of them, the storage layer still reads all 50 columns for every matching row. For analytics queries that scan millions or billions of rows, reading irrelevant columns is a significant waste.

Columnar storage stores each column separately. A query that reads total_amount, order_date, and customer_segment from a 50-column table reads only 3/50 of the data. This dramatically reduces I/O for typical analytics queries.

  Row Store vs Column Store
  ==========================

  Row Store (RDS/Aurora/transactional DB):
  Disk: [id=1, name=Alice, age=29, country=US, amount=99.99, ...]
  Disk: [id=2, name=Bob,   age=34, country=CA, amount=245.00, ...]

  Query: SELECT SUM(amount) FROM orders WHERE country='US'
  Must read entire rows including name, age, and 45 other columns

  Column Store (Redshift):
  Disk (amount column):   [99.99, 245.00, 12.50, 500.00, ...]
  Disk (country column):  [US, CA, US, US, ...]

  Query: SELECT SUM(amount) FROM orders WHERE country='US'
  Reads only 2 columns out of 50
  Compression per column is also more effective (similar values together)

Column data also compresses dramatically better because adjacent values in a column are often similar (e.g., a country column might have thousands of repeated “US” values). Redshift automatically selects compression encodings per column during COPY operations or with the ANALYZE COMPRESSION command.

Massively Parallel Processing (MPP)

A Redshift cluster consists of a leader node and multiple compute nodes. The leader node receives queries, generates a query plan, distributes work to compute nodes, and aggregates results. Each compute node processes a subset of the data in parallel.

  Redshift MPP Architecture
  ==========================

  SQL Client / BI Tool
         │
         ▼
  Leader Node (query planning, result aggregation)
         │
    ┌────┼────┐
    ▼    ▼    ▼
  Node1 Node2 Node3 (each processes its slice of data)
   │     │     │
  Disk  Disk  Disk  (each node holds a portion of each table)

  A query on a 10 TB table with 10 nodes:
  each node processes ~1 TB in parallel
  total elapsed time ≈ time for 1 TB query

The number and type of compute nodes determines the cluster’s performance and cost. RA3 nodes (the current generation) use managed storage that separates compute and storage — you scale compute and storage independently, and data is cached locally on the compute nodes with overflow to S3.

Data Distribution Styles

How Redshift distributes rows across compute nodes significantly impacts query performance. Poor distribution causes data skew (some nodes have much more data than others) and excessive data movement between nodes during joins.

EVEN: rows are distributed in round-robin fashion across nodes. Good default for tables without obvious join patterns.

KEY: rows with the same value in the distribution column go to the same node. When two tables are joined on the distribution key, the joined rows are already co-located, eliminating network data movement.

ALL: a full copy of the entire table is placed on every node. Use this for small dimension tables (under a few million rows) that are joined frequently — no matter which node the fact table data is on, the dimension table is already there.

AUTO: Redshift chooses the distribution style based on table size, changing automatically as the table grows.

Sort Keys

Sort keys define the physical order of data on disk within a table. Queries that filter on the sort key benefit from the zone maps — metadata that tracks the minimum and maximum values per 1 MB disk block, letting Redshift skip entire blocks without reading them.

  Sort Key Benefit
  =================

  Table: sales, sorted by order_date

  Block 1: order_date 2024-01-01 to 2024-01-15  (min/max in zone map)
  Block 2: order_date 2024-01-16 to 2024-01-31
  Block 3: order_date 2024-02-01 to 2024-02-28
  ...
  Block 20: order_date 2024-10-01 to 2024-10-31

  Query: WHERE order_date BETWEEN '2024-10-01' AND '2024-10-31'
  Redshift reads only Block 20 — skips 19 blocks entirely

  Compound sort key: (year, month, order_date) for hierarchical filtering
  Interleaved sort key: equal weight on multiple columns for varied filter patterns

Redshift Serverless

Redshift Serverless removes the need to size and manage clusters. You configure a capacity measured in Redshift Processing Units (RPUs) with a maximum to control costs. The service scales automatically based on workload, charges per second of compute used, and scales to zero when idle.

This is appropriate for:

Intermittent analytics workloads that run for a few hours each day
Development and testing environments
Small-to-medium data warehouses where the operational overhead of managing clusters is not warranted

For continuous, heavy analytics workloads, provisioned clusters with reserved instance pricing are typically more cost-effective.

Redshift Spectrum

Spectrum extends Redshift queries to data stored in S3 — without loading the data into Redshift first. Using an external schema pointing to an AWS Glue Data Catalog database, you can join Redshift tables with S3 data as if it were all in the same database.

  Redshift Spectrum Architecture
  ================================

  SELECT r.customer_id, s.event_type, r.total_spent
  FROM redshift_table r
  JOIN spectrum_schema.clickstream s ON r.customer_id = s.user_id
  WHERE s.event_date >= '2025-06-01'

  Execution:
  ├── redshift_table → read from Redshift cluster (local)
  └── clickstream → Spectrum reads from S3 (Parquet files in data lake)
                    Spectrum nodes run in parallel, push-down filters to S3
                    Only matching rows returned to Redshift leader

  No ETL required to query S3 data

This enables a tiered data architecture: recent, frequently queried data loaded into Redshift; historical data kept in S3 in a cost-effective format like Parquet, queried via Spectrum when needed.

Real-World Use Case: Retail Analytics Platform

A retailer processes 50 million transactions per day. The analytics team runs:

Hourly sales reports broken down by store, product, and region
Customer cohort analysis comparing spending patterns across acquisition channels
Inventory forecasting using 3 years of historical sales data

Architecture:

Transactions loaded from RDS via AWS Glue ETL jobs into Redshift hourly
sales table distributed by store_id (largest join dimension), sorted by transaction_date
Small dimension tables (products, stores, regions) use ALL distribution for co-located joins
3+ year old data moved to S3 in Parquet format, queried via Spectrum when needed for historical analysis
Tableau connects to Redshift for BI dashboards; result caching serves repeated identical dashboard queries from memory without rerunning SQL

Key Interview Points

Redshift is OLAP, not OLTP — never use it for operational transaction processing; use RDS for that
COPY command is the fastest way to load data — it reads in parallel from S3, DynamoDB, or other sources; individual INSERT statements are slow in Redshift
VACUUM command reclaims space from deleted rows and re-sorts data after updates — run regularly or use automatic vacuum settings
ANALYZE command updates statistics that the query planner uses — run after significant data loads
Concurrency Scaling adds temporary cluster capacity during query spikes, charged per second of use
WLM (Workload Management) defines query queues with priority levels — prevents long analytical queries from starving short dashboard queries
Redshift does not support row-level security natively — implement it via views or column-level security controls
Compression with columnar encoding reduces storage by 3-10x and speeds up queries by reducing I/O — always encode columns appropriately