Amazon Redshift: Column-Store Data Warehousing at Petabyte Scale
Operational databases like RDS and DynamoDB are optimized for transactional queries — insert a row, look up a record by primary key, update a specific field. Analytics queries look completely different: sum all sales from the past six months, calculate average order value by customer segment across 500 million orders, compare year-over-year revenue by product category. These queries touch enormous amounts of data and need completely different storage and execution architectures.
Amazon Redshift is built specifically for analytics. Its columnar storage, massively parallel processing, and deep integration with the AWS data ecosystem make it the right tool when you need to run SQL on large datasets — from hundreds of gigabytes to multiple petabytes.
Why Columnar Storage Changes Everything for Analytics
A traditional row-store database stores each row contiguously on disk. When you query a table with 50 columns but only need 3 of them, the storage layer still reads all 50 columns for every matching row. For analytics queries that scan millions or billions of rows, reading irrelevant columns is a significant waste.
Columnar storage stores each column separately. A query that reads total_amount, order_date, and customer_segment from a 50-column table reads only 3/50 of the data. This dramatically reduces I/O for typical analytics queries.
Row Store vs Column Store ==========================
Row Store (RDS/Aurora/transactional DB): Disk: [id=1, name=Alice, age=29, country=US, amount=99.99, ...] Disk: [id=2, name=Bob, age=34, country=CA, amount=245.00, ...]
Query: SELECT SUM(amount) FROM orders WHERE country='US' Must read entire rows including name, age, and 45 other columns
Column Store (Redshift): Disk (amount column): [99.99, 245.00, 12.50, 500.00, ...] Disk (country column): [US, CA, US, US, ...]
Query: SELECT SUM(amount) FROM orders WHERE country='US' Reads only 2 columns out of 50 Compression per column is also more effective (similar values together)Column data also compresses dramatically better because adjacent values in a column are often similar (e.g., a country column might have thousands of repeated “US” values). Redshift automatically selects compression encodings per column during COPY operations or with the ANALYZE COMPRESSION command.
Massively Parallel Processing (MPP)
A Redshift cluster consists of a leader node and multiple compute nodes. The leader node receives queries, generates a query plan, distributes work to compute nodes, and aggregates results. Each compute node processes a subset of the data in parallel.
Redshift MPP Architecture ==========================
SQL Client / BI Tool │ ▼ Leader Node (query planning, result aggregation) │ ┌────┼────┐ ▼ ▼ ▼ Node1 Node2 Node3 (each processes its slice of data) │ │ │ Disk Disk Disk (each node holds a portion of each table)
A query on a 10 TB table with 10 nodes: each node processes ~1 TB in parallel total elapsed time ≈ time for 1 TB queryThe number and type of compute nodes determines the cluster’s performance and cost. RA3 nodes (the current generation) use managed storage that separates compute and storage — you scale compute and storage independently, and data is cached locally on the compute nodes with overflow to S3.
Data Distribution Styles
How Redshift distributes rows across compute nodes significantly impacts query performance. Poor distribution causes data skew (some nodes have much more data than others) and excessive data movement between nodes during joins.
EVEN: rows are distributed in round-robin fashion across nodes. Good default for tables without obvious join patterns.
KEY: rows with the same value in the distribution column go to the same node. When two tables are joined on the distribution key, the joined rows are already co-located, eliminating network data movement.
ALL: a full copy of the entire table is placed on every node. Use this for small dimension tables (under a few million rows) that are joined frequently — no matter which node the fact table data is on, the dimension table is already there.
AUTO: Redshift chooses the distribution style based on table size, changing automatically as the table grows.
Sort Keys
Sort keys define the physical order of data on disk within a table. Queries that filter on the sort key benefit from the zone maps — metadata that tracks the minimum and maximum values per 1 MB disk block, letting Redshift skip entire blocks without reading them.
Sort Key Benefit =================
Table: sales, sorted by order_date
Block 1: order_date 2024-01-01 to 2024-01-15 (min/max in zone map) Block 2: order_date 2024-01-16 to 2024-01-31 Block 3: order_date 2024-02-01 to 2024-02-28 ... Block 20: order_date 2024-10-01 to 2024-10-31
Query: WHERE order_date BETWEEN '2024-10-01' AND '2024-10-31' Redshift reads only Block 20 — skips 19 blocks entirely
Compound sort key: (year, month, order_date) for hierarchical filtering Interleaved sort key: equal weight on multiple columns for varied filter patternsRedshift Serverless
Redshift Serverless removes the need to size and manage clusters. You configure a capacity measured in Redshift Processing Units (RPUs) with a maximum to control costs. The service scales automatically based on workload, charges per second of compute used, and scales to zero when idle.
This is appropriate for:
- Intermittent analytics workloads that run for a few hours each day
- Development and testing environments
- Small-to-medium data warehouses where the operational overhead of managing clusters is not warranted
For continuous, heavy analytics workloads, provisioned clusters with reserved instance pricing are typically more cost-effective.
Redshift Spectrum
Spectrum extends Redshift queries to data stored in S3 — without loading the data into Redshift first. Using an external schema pointing to an AWS Glue Data Catalog database, you can join Redshift tables with S3 data as if it were all in the same database.
Redshift Spectrum Architecture ================================
SELECT r.customer_id, s.event_type, r.total_spent FROM redshift_table r JOIN spectrum_schema.clickstream s ON r.customer_id = s.user_id WHERE s.event_date >= '2025-06-01'
Execution: ├── redshift_table → read from Redshift cluster (local) └── clickstream → Spectrum reads from S3 (Parquet files in data lake) Spectrum nodes run in parallel, push-down filters to S3 Only matching rows returned to Redshift leader
No ETL required to query S3 dataThis enables a tiered data architecture: recent, frequently queried data loaded into Redshift; historical data kept in S3 in a cost-effective format like Parquet, queried via Spectrum when needed.
Real-World Use Case: Retail Analytics Platform
A retailer processes 50 million transactions per day. The analytics team runs:
- Hourly sales reports broken down by store, product, and region
- Customer cohort analysis comparing spending patterns across acquisition channels
- Inventory forecasting using 3 years of historical sales data
Architecture:
- Transactions loaded from RDS via AWS Glue ETL jobs into Redshift hourly
salestable distributed bystore_id(largest join dimension), sorted bytransaction_date- Small dimension tables (
products,stores,regions) use ALL distribution for co-located joins - 3+ year old data moved to S3 in Parquet format, queried via Spectrum when needed for historical analysis
- Tableau connects to Redshift for BI dashboards; result caching serves repeated identical dashboard queries from memory without rerunning SQL
Key Interview Points
- Redshift is OLAP, not OLTP — never use it for operational transaction processing; use RDS for that
- COPY command is the fastest way to load data — it reads in parallel from S3, DynamoDB, or other sources; individual INSERT statements are slow in Redshift
- VACUUM command reclaims space from deleted rows and re-sorts data after updates — run regularly or use automatic vacuum settings
- ANALYZE command updates statistics that the query planner uses — run after significant data loads
- Concurrency Scaling adds temporary cluster capacity during query spikes, charged per second of use
- WLM (Workload Management) defines query queues with priority levels — prevents long analytical queries from starving short dashboard queries
- Redshift does not support row-level security natively — implement it via views or column-level security controls
- Compression with columnar encoding reduces storage by 3-10x and speeds up queries by reducing I/O — always encode columns appropriately