Interviews

🎯 Interview Guides 12 guides · updated 2026

Real questions and structured answers for data, cloud, and AI engineering interviews — including the system-design and GenAI rounds now showing up everywhere.

AWS Redshift Interview Questions and Answers

Amazon Redshift is the most widely used cloud data warehouse. These questions cover architecture, performance, SQL optimization, and modern integrations — the topics that come up most in data engineering and analytics engineering interviews.


Fundamentals

Q1. What is Amazon Redshift and how does it differ from a traditional RDBMS?

Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse optimized for analytical workloads (OLAP). Key differences from transactional databases:


Q2. Explain Redshift’s node architecture.

A Redshift cluster has:

RA3 is the recommended type for most workloads as of 2025 because you pay for compute and storage separately.


Q3. What is a distribution style and why does it matter?

Distribution style determines how rows are physically distributed across slices, directly affecting join and aggregation performance:

StyleHow it worksBest for
KEYRows with same key value go to same sliceLarge fact-dimension joins on that key
ALLFull copy of table on every nodeSmall dimension tables (< 3M rows)
EVENRound-robin across slicesNo dominant join key, balanced scans
AUTORedshift decides based on table sizeDefault; good starting point

Choosing KEY distribution for frequently joined large tables eliminates data redistribution (data movement) during query execution, which is one of the biggest performance wins.


Q4. What is a sort key and what are the different types?

Sort keys determine the physical order of data on disk. Redshift uses zone maps (min/max metadata per block) to skip blocks that can’t contain query-relevant rows.

Compound sort key — defines a hierarchy of columns. Effective only if queries filter on the leading column. Example: COMPOUND SORTKEY (event_date, user_id) — queries filtering on event_date benefit; queries filtering only on user_id don’t.

Interleaved sort key — weighs each column equally; effective for queries filtering on any combination of columns. Higher maintenance overhead (VACUUM is more expensive).

Best practice 2025: use compound sort keys with the most common filter column first. RA3 nodes with automatic table optimization (ATO) can recommend sort/distribution keys based on actual query patterns.


Q5. What is VACUUM in Redshift and when should you run it?

Redshift uses MVCC — deleted rows aren’t immediately removed; they’re marked as deleted but still on disk. VACUUM reclaims this space and re-sorts unsorted rows in the sort key order.

VACUUM FULL my_table; -- Reclaim space AND resort
VACUUM SORT ONLY my_table; -- Re-sort only (no reclaim)
VACUUM DELETE ONLY my_table; -- Reclaim space only (faster)
VACUUM REINDEX my_table; -- For interleaved sort keys

With RA3 nodes, Redshift runs automatic vacuuming in the background. Manual VACUUM is still useful after large bulk deletes or when performance degrades noticeably.


Performance Tuning

Q6. How do you diagnose a slow Redshift query?

Step-by-step approach:

-- 1. Find the slow query
SELECT query, elapsed, rows, label
FROM stl_query
WHERE userid > 1
ORDER BY elapsed DESC
LIMIT 10;
-- 2. Check the execution plan
EXPLAIN SELECT ...;
-- 3. Look for data movement (DS_BCAST_INNER, DS_DIST_ALL_NONE are expensive)
-- Check SVL_QUERY_REPORT for actual execution metrics
SELECT query, step, rows, bytes, label
FROM svl_query_report
WHERE query = <query_id>
ORDER BY step;
-- 4. Check for table skew (uneven data distribution)
SELECT slice, COUNT(*) FROM my_table GROUP BY slice ORDER BY slice;

Key warning signs in EXPLAIN output:


Q7. What is Redshift Spectrum and when would you use it?

Redshift Spectrum extends query capability directly to data in Amazon S3 — without loading it into Redshift cluster storage. It uses external tables defined in the AWS Glue Data Catalog.

-- Create an external schema pointing to Glue catalog
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'my_glue_db'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Query S3 data directly
SELECT s.event_date, COUNT(*) as events
FROM spectrum_schema.raw_events s
JOIN my_redshift_table r ON s.user_id = r.user_id
WHERE s.event_date >= '2025-01-01'
GROUP BY 1;

Use Spectrum for:


Q8. Explain WLM (Workload Management) in Redshift.

WLM controls how queries are queued and allocated memory/concurrency across different user groups or query types.

Automatic WLM (recommended) — Redshift allocates memory and concurrency automatically based on query complexity and system load. Default since 2019.

Manual WLM — you define up to 8 queues with explicit concurrency levels (number of simultaneous queries) and memory percentages.

-- Check queue assignment
SELECT query, queue_start_time, queue_end_time, exec_start_time
FROM stl_wlm_query
WHERE query = <id>;

Best practice: enable automatic WLM with concurrency scaling (burst capacity from additional clusters on-demand) for variable workloads.


Q9. What is the COPY command and why is it preferred for bulk loading?

COPY is Redshift’s bulk load mechanism — it reads in parallel from multiple S3 files across all compute node slices simultaneously.

COPY sales
FROM 's3://my-bucket/data/sales/'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftRole'
FORMAT AS PARQUET;
-- With options
COPY orders
FROM 's3://my-bucket/orders/'
IAM_ROLE '...'
CSV
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
MAXERROR 10;

Preferred over INSERT because:

Best practice: split source files so there’s at least one file per node slice (e.g., 16 slices → at least 16 files).


Q10. What is Redshift Serverless and how does it differ from provisioned clusters?

Redshift Serverless (GA 2022) automatically provisions and scales compute capacity. You don’t manage nodes — you set a maximum RPU (Redshift Processing Units) limit.

FeatureProvisionedServerless
BillingPer node-hourPer RPU-second (while active)
ScalingManual or scheduledAutomatic
Concurrency scalingOptional add-onBuilt-in
Best forPredictable, continuous workloadsSporadic or unpredictable usage
Cold startNoneSeconds for first query

Serverless is often more cost-effective for development environments, ad-hoc analytics teams, and workloads with large idle periods.


SQL & Data Modeling

Q11. How does Redshift handle transactions and concurrency?

Redshift uses serializable isolation with MVCC (Multi-Version Concurrency Control). Reads don’t block writes and vice versa. However, Redshift does not support row-level locking — DDL statements (ALTER TABLE, TRUNCATE) acquire table-level locks.

Common concurrency patterns:

-- Atomic swap pattern for loading (no downtime)
BEGIN;
CREATE TABLE orders_new (LIKE orders);
COPY orders_new FROM 's3://...' ...;
ALTER TABLE orders RENAME TO orders_old;
ALTER TABLE orders_new RENAME TO orders;
COMMIT;
DROP TABLE orders_old;

Q12. What are late-binding views and materialized views in Redshift?

Late-binding views — don’t validate column existence at creation time. Queries succeed even if the underlying table changes, as long as the referenced columns exist at query time. Useful for Spectrum (external table) views.

CREATE VIEW active_users WITH NO SCHEMA BINDING AS
SELECT user_id, created_at FROM external_schema.users WHERE active = true;

Materialized views — pre-compute and cache query results. Redshift can refresh them automatically or on demand.

CREATE MATERIALIZED VIEW daily_revenue AS
SELECT DATE(order_ts) as date, SUM(amount) as revenue
FROM orders GROUP BY 1;
REFRESH MATERIALIZED VIEW daily_revenue;

Materialized views are especially powerful for BI tools that repeatedly run the same aggregate queries.


Modern Features (2024–2025)

Q13. What is Amazon Redshift’s Zero-ETL integration?

Zero-ETL (launched 2023, expanded 2024-2025) enables near-real-time data replication from Amazon Aurora, RDS, and DynamoDB directly into Redshift — without building or maintaining an ETL pipeline. Data is available for analytics within seconds of transactional commits.

Use cases: operational analytics, real-time dashboards on transactional data, eliminating CDC pipeline maintenance.


Q14. How does Redshift ML work?

Redshift ML integrates with Amazon SageMaker to train and deploy ML models directly from SQL:

-- Train a model
CREATE MODEL churn_predictor
FROM (SELECT feature1, feature2, label FROM training_data)
TARGET label
FUNCTION predict_churn
IAM_ROLE '...'
SETTINGS (S3_BUCKET 'my-bucket');
-- Inference in SQL
SELECT user_id, predict_churn(tenure, monthly_spend, support_tickets) as churn_prob
FROM customers;

This allows data teams to build and serve predictions without leaving the SQL environment.


Q15. What best practices would you follow when designing a Redshift schema?


Quick Reference: Key System Tables

TableWhat it shows
SVL_QLOGQuery log summary
STL_QUERYFull query history
SVV_TABLE_INFOTable stats, skew, unsorted %
STL_LOAD_ERRORSCOPY command errors
SVL_QUERY_REPORTPer-step execution metrics
STV_BLOCKLISTBlock-level storage layout
SVV_REDSHIFT_TABLESRA3 managed storage info