AWS Redshift Interview Questions and Answers

Amazon Redshift is the most widely used cloud data warehouse. These questions cover architecture, performance, SQL optimization, and modern integrations — the topics that come up most in data engineering and analytics engineering interviews.

Fundamentals

Q1. What is Amazon Redshift and how does it differ from a traditional RDBMS?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse optimized for analytical workloads (OLAP). Key differences from transactional databases:

Columnar storage — data stored by column, not row, so analytical queries that read a few columns out of many are dramatically faster
Massively parallel processing (MPP) — query execution distributed across compute nodes
Compression — automatic column-level compression reduces storage and speeds I/O
No row-level locking — designed for bulk loads and large scans, not point lookups or frequent updates
Leader + compute node architecture — the leader node parses SQL and distributes work; compute nodes execute slices of the query

Q2. Explain Redshift’s node architecture.

A Redshift cluster has:

Leader node — receives queries, builds execution plans, coordinates compute nodes, aggregates results
Compute nodes — store data and execute query slices. Each compute node divides its storage into slices (one per vCPU)
Node types — RA3 (managed storage, scale compute and storage independently), DC2 (dense compute, NVMe SSD, fixed storage per node)

RA3 is the recommended type for most workloads as of 2025 because you pay for compute and storage separately.

Q3. What is a distribution style and why does it matter?

Distribution style determines how rows are physically distributed across slices, directly affecting join and aggregation performance:

Style	How it works	Best for
`KEY`	Rows with same key value go to same slice	Large fact-dimension joins on that key
`ALL`	Full copy of table on every node	Small dimension tables (< 3M rows)
`EVEN`	Round-robin across slices	No dominant join key, balanced scans
`AUTO`	Redshift decides based on table size	Default; good starting point

Choosing KEY distribution for frequently joined large tables eliminates data redistribution (data movement) during query execution, which is one of the biggest performance wins.

Q4. What is a sort key and what are the different types?

Sort keys determine the physical order of data on disk. Redshift uses zone maps (min/max metadata per block) to skip blocks that can’t contain query-relevant rows.

Compound sort key — defines a hierarchy of columns. Effective only if queries filter on the leading column. Example: COMPOUND SORTKEY (event_date, user_id) — queries filtering on event_date benefit; queries filtering only on user_id don’t.

Interleaved sort key — weighs each column equally; effective for queries filtering on any combination of columns. Higher maintenance overhead (VACUUM is more expensive).

Best practice 2025: use compound sort keys with the most common filter column first. RA3 nodes with automatic table optimization (ATO) can recommend sort/distribution keys based on actual query patterns.

Q5. What is VACUUM in Redshift and when should you run it?

Redshift uses MVCC — deleted rows aren’t immediately removed; they’re marked as deleted but still on disk. VACUUM reclaims this space and re-sorts unsorted rows in the sort key order.

VACUUM FULL my_table;          -- Reclaim space AND resort
VACUUM SORT ONLY my_table;     -- Re-sort only (no reclaim)
VACUUM DELETE ONLY my_table;   -- Reclaim space only (faster)
VACUUM REINDEX my_table;       -- For interleaved sort keys

With RA3 nodes, Redshift runs automatic vacuuming in the background. Manual VACUUM is still useful after large bulk deletes or when performance degrades noticeably.

Performance Tuning

Q6. How do you diagnose a slow Redshift query?

Step-by-step approach:

-- 1. Find the slow query
SELECT query, elapsed, rows, label
FROM stl_query
WHERE userid > 1
ORDER BY elapsed DESC
LIMIT 10;

-- 2. Check the execution plan
EXPLAIN SELECT ...;

-- 3. Look for data movement (DS_BCAST_INNER, DS_DIST_ALL_NONE are expensive)
-- Check SVL_QUERY_REPORT for actual execution metrics
SELECT query, step, rows, bytes, label
FROM svl_query_report
WHERE query = <query_id>
ORDER BY step;

-- 4. Check for table skew (uneven data distribution)
SELECT slice, COUNT(*) FROM my_table GROUP BY slice ORDER BY slice;

Key warning signs in EXPLAIN output:

DS_BCAST_INNER — small table broadcast (acceptable)
DS_DIST_ALL_NONE — large table redistributed (expensive, fix with KEY distribution)
DS_DIST_BOTH — both tables redistributed (worst case)

Q7. What is Redshift Spectrum and when would you use it?

Redshift Spectrum extends query capability directly to data in Amazon S3 — without loading it into Redshift cluster storage. It uses external tables defined in the AWS Glue Data Catalog.

-- Create an external schema pointing to Glue catalog
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'my_glue_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

-- Query S3 data directly
SELECT s.event_date, COUNT(*) as events
FROM spectrum_schema.raw_events s
JOIN my_redshift_table r ON s.user_id = r.user_id
WHERE s.event_date >= '2025-01-01'
GROUP BY 1;

Use Spectrum for:

Historical/cold data that doesn’t justify cluster storage cost
Querying raw landing zone data before it’s processed into Redshift
Hybrid architectures (Redshift for hot data, S3 for cold data)

Q8. Explain WLM (Workload Management) in Redshift.

WLM controls how queries are queued and allocated memory/concurrency across different user groups or query types.

Automatic WLM (recommended) — Redshift allocates memory and concurrency automatically based on query complexity and system load. Default since 2019.

Manual WLM — you define up to 8 queues with explicit concurrency levels (number of simultaneous queries) and memory percentages.

-- Check queue assignment
SELECT query, queue_start_time, queue_end_time, exec_start_time
FROM stl_wlm_query
WHERE query = <id>;

Best practice: enable automatic WLM with concurrency scaling (burst capacity from additional clusters on-demand) for variable workloads.

Q9. What is the COPY command and why is it preferred for bulk loading?

COPY is Redshift’s bulk load mechanism — it reads in parallel from multiple S3 files across all compute node slices simultaneously.

COPY sales
FROM 's3://my-bucket/data/sales/'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftRole'
FORMAT AS PARQUET;

-- With options
COPY orders
FROM 's3://my-bucket/orders/'
IAM_ROLE '...'
CSV
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
MAXERROR 10;

Preferred over INSERT because:

Reads data in parallel (one file per slice for maximum throughput)
Applies compression automatically
Significantly faster — GB/s vs rows/s with INSERT

Best practice: split source files so there’s at least one file per node slice (e.g., 16 slices → at least 16 files).

Q10. What is Redshift Serverless and how does it differ from provisioned clusters?

Redshift Serverless (GA 2022) automatically provisions and scales compute capacity. You don’t manage nodes — you set a maximum RPU (Redshift Processing Units) limit.

Feature	Provisioned	Serverless
Billing	Per node-hour	Per RPU-second (while active)
Scaling	Manual or scheduled	Automatic
Concurrency scaling	Optional add-on	Built-in
Best for	Predictable, continuous workloads	Sporadic or unpredictable usage
Cold start	None	Seconds for first query

Serverless is often more cost-effective for development environments, ad-hoc analytics teams, and workloads with large idle periods.

SQL & Data Modeling

Q11. How does Redshift handle transactions and concurrency?

Redshift uses serializable isolation with MVCC (Multi-Version Concurrency Control). Reads don’t block writes and vice versa. However, Redshift does not support row-level locking — DDL statements (ALTER TABLE, TRUNCATE) acquire table-level locks.

Common concurrency patterns:

-- Atomic swap pattern for loading (no downtime)
BEGIN;
  CREATE TABLE orders_new (LIKE orders);
  COPY orders_new FROM 's3://...' ...;
  ALTER TABLE orders RENAME TO orders_old;
  ALTER TABLE orders_new RENAME TO orders;
COMMIT;
DROP TABLE orders_old;

Q12. What are late-binding views and materialized views in Redshift?

Late-binding views — don’t validate column existence at creation time. Queries succeed even if the underlying table changes, as long as the referenced columns exist at query time. Useful for Spectrum (external table) views.

CREATE VIEW active_users WITH NO SCHEMA BINDING AS
SELECT user_id, created_at FROM external_schema.users WHERE active = true;

Materialized views — pre-compute and cache query results. Redshift can refresh them automatically or on demand.

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT DATE(order_ts) as date, SUM(amount) as revenue
FROM orders GROUP BY 1;

REFRESH MATERIALIZED VIEW daily_revenue;

Materialized views are especially powerful for BI tools that repeatedly run the same aggregate queries.

Modern Features (2024–2025)

Q13. What is Amazon Redshift’s Zero-ETL integration?

Zero-ETL (launched 2023, expanded 2024-2025) enables near-real-time data replication from Amazon Aurora, RDS, and DynamoDB directly into Redshift — without building or maintaining an ETL pipeline. Data is available for analytics within seconds of transactional commits.

Use cases: operational analytics, real-time dashboards on transactional data, eliminating CDC pipeline maintenance.

Q14. How does Redshift ML work?

Redshift ML integrates with Amazon SageMaker to train and deploy ML models directly from SQL:

-- Train a model
CREATE MODEL churn_predictor
FROM (SELECT feature1, feature2, label FROM training_data)
TARGET label
FUNCTION predict_churn
IAM_ROLE '...'
SETTINGS (S3_BUCKET 'my-bucket');

-- Inference in SQL
SELECT user_id, predict_churn(tenure, monthly_spend, support_tickets) as churn_prob
FROM customers;

This allows data teams to build and serve predictions without leaving the SQL environment.

Q15. What best practices would you follow when designing a Redshift schema?

Use RA3 nodes with managed storage for flexibility
Define distribution keys on the most common join column for large fact tables
Use ALL distribution for dimension tables under 3M rows
Define compound sort keys on the primary filter column (usually a date column)
Compress all columns (use ENCODE AUTO or ANALYZE COMPRESSION)
Partition Spectrum external tables on date columns
Use materialized views for frequently re-run aggregations
Enable automatic table optimization and Advisor recommendations
Load data via COPY with multiple parallel files
Schedule VACUUM during off-peak hours for heavily updated tables

Quick Reference: Key System Tables

Table	What it shows
`SVL_QLOG`	Query log summary
`STL_QUERY`	Full query history
`SVV_TABLE_INFO`	Table stats, skew, unsorted %
`STL_LOAD_ERRORS`	COPY command errors
`SVL_QUERY_REPORT`	Per-step execution metrics
`STV_BLOCKLIST`	Block-level storage layout
`SVV_REDSHIFT_TABLES`	RA3 managed storage info