Interviews

🎯 Interview Guides 12 guides · updated 2026

Real questions and structured answers for data, cloud, and AI engineering interviews β€” including the system-design and GenAI rounds now showing up everywhere.

Apache Spark Interview Questions and Answers

These questions cover Spark’s architecture, APIs, and performance β€” what’s tested for data engineer, ML engineer, and big data architect roles using Apache Spark.


Architecture & Core Concepts

Q1. What is Apache Spark and how does it differ from Hadoop MapReduce?

Spark is a distributed in-memory data processing engine for large-scale analytics, ML, and streaming. It provides unified APIs (Python, Scala, Java, R) for batch, streaming, SQL, and ML workloads.

FeatureApache SparkHadoop MapReduce
ProcessingIn-memory (spills to disk when needed)Disk-based after each step
Speed10–100Γ— faster for iterative algorithmsSlow for multi-step jobs
APIRDD, DataFrame, Dataset, SQL, StreamingMap + Reduce (verbose)
Fault toleranceRDD lineage / DAG recomputationJob restarts
Machine learningMLlib integratedSeparate tools (Mahout)
StreamingStructured Streaming (native)Storm integration

Spark is the dominant choice for data engineering in 2025 β€” most cloud platforms (Databricks, AWS EMR, Google Dataproc) run Spark under the hood.


Q2. Explain Spark’s architecture: Driver, Executors, Cluster Manager.

Client
└── Driver Program
β”œβ”€β”€ SparkContext / SparkSession
β”œβ”€β”€ DAG Scheduler β†’ Physical Plan β†’ Task Sets
└── Task Scheduler
Cluster Manager (YARN / Kubernetes / Standalone / Mesos)
└── Allocates resources, monitors executors
Worker Nodes
└── Executor (JVM process per worker)
β”œβ”€β”€ Task (runs on one partition)
└── Block Manager (caches RDD/DataFrame partitions)

Q3. What is the difference between an RDD, DataFrame, and Dataset?

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()
# RDD β€” lowest level, untyped, JVM objects
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
doubled = rdd.map(lambda x: x * 2).filter(lambda x: x > 4)
# DataFrame β€” distributed table with named columns and types
# Uses Catalyst optimizer β€” FASTER than RDD for most queries
df = spark.read.parquet("s3://bucket/sales/")
result = df.filter("amount > 100").groupBy("region").agg({"amount": "sum"})
# Dataset β€” type-safe DataFrame (Scala/Java only; Python has no Dataset)
# Catches type errors at compile time in Scala
case class Sale(region: String, amount: Double)
val ds: Dataset[Sale] = spark.read.parquet("...").as[Sale]

Recommendation: always use DataFrames/Datasets over RDDs β€” Catalyst optimizer and Tungsten execution engine make them significantly faster. Only drop to RDD when you need fine-grained control over partitioning or need Python functions that DataFrames can’t express.


Spark SQL & DataFrames

Q4. Explain the Catalyst optimizer and how it improves query performance.

Catalyst is Spark’s query optimizer β€” it rewrites and optimizes your DataFrame transformations before execution:

  1. Analysis β€” resolves column names and types against the schema
  2. Logical optimization β€” rule-based: predicate pushdown, constant folding, projection pruning
  3. Physical planning β€” generates multiple physical plans (e.g., BroadcastHashJoin vs SortMergeJoin)
  4. Code generation β€” generates optimized Java bytecode (Tungsten whole-stage codegen)
# You write this:
df.filter("year = 2025").join(departments, "dept_id").select("name", "salary")
# Catalyst might rewrite to:
# - Push filter before join (read less data)
# - Prune columns (select only what's needed before join)
# - Choose BroadcastHashJoin if departments is small
# - Push filter down to Parquet reader (skip row groups)

The optimizer makes DataFrame queries faster than equivalent RDD code in almost all cases β€” even badly written SQL benefits from these rewrites.


Q5. How do you optimize a slow Spark job?

1. Check the Spark UI first β€” identify skewed stages, shuffle spills, OOM errors.

2. Partitioning:

# Too few partitions β†’ underutilized cores
# Too many β†’ scheduling overhead
# Rule of thumb: 2-4 tasks per CPU core in cluster
# Default shuffle partitions: 200 (often too many for small data, too few for large)
spark.conf.set("spark.sql.shuffle.partitions", "400") # Tune to data size
# Repartition before expensive operations
df = df.repartition(400, "dept_id") # Evenly distribute by dept_id

3. Handle data skew:

# Problem: one partition has 90% of the data
# Solution 1: Salting β€” add random key to break skew
from pyspark.sql.functions import concat, col, lit, rand
import math
df_salted = df.withColumn("salted_key",
concat(col("skewed_key"), lit("_"), (rand() * 10).cast("int")))
# Solution 2: AQE (Adaptive Query Execution) β€” enabled by default in Spark 3.0+
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

4. Broadcast joins:

from pyspark.sql.functions import broadcast
# If one side < 10MB (default threshold), Spark auto-broadcasts
# Force broadcast for up to ~100MB
result = large_df.join(broadcast(small_dim_df), "key")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600") # 100MB

5. Caching:

# Cache frequently reused DataFrames
lookup = spark.read.parquet("s3://bucket/dim/").cache()
lookup.count() # Trigger materialization
# Use DISK_ONLY for data larger than memory
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)

6. File format: Always use Parquet (columnar, compressed, predicate pushdown). Avoid CSV for large data.


Q6. What is lazy evaluation in Spark and what triggers execution?

Spark builds a DAG of transformations without executing them. Execution only happens when an action is called:

# These are TRANSFORMATIONS β€” build the plan, nothing runs
df1 = spark.read.parquet("s3://sales/")
df2 = df1.filter("year >= 2023")
df3 = df2.groupBy("region").agg({"amount": "sum"})
df4 = df3.orderBy("sum(amount)", ascending=False)
# These are ACTIONS β€” trigger execution
df4.show() # Runs all transformations
df4.collect() # Returns all rows to driver (careful with large data!)
df4.count() # Returns count
df4.write.parquet("s3://output/") # Writes result
df4.first() # Returns first row

Benefits of lazy evaluation:


Structured Streaming

Q7. How does Spark Structured Streaming work?

Structured Streaming treats streaming data as an unbounded table β€” the same DataFrame/SQL API works for batch and streaming:

from pyspark.sql.functions import window, col, sum
# Read from Kafka
stream_df = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "broker:9092")
.option("subscribe", "orders")
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) AS json")
)
from pyspark.sql.functions import from_json, schema_of_json
from pyspark.sql.types import StructType, StringType, DoubleType
schema = StructType().add("order_id", StringType()).add("amount", DoubleType())
orders = stream_df.select(from_json(col("json"), schema).alias("data")).select("data.*")
# Windowed aggregation β€” sum per 5-minute window
windowed = (orders
.withWatermark("event_time", "10 minutes") # Handle late data
.groupBy(window(col("event_time"), "5 minutes"), "region")
.agg(sum("amount").alias("total_amount"))
)
# Write to sink
query = (windowed.writeStream
.format("delta") # Or parquet, kafka, memory, console
.outputMode("append") # append / complete / update
.option("checkpointLocation", "s3://checkpoints/orders/")
.trigger(processingTime="1 minute") # Or Trigger.AvailableNow() for micro-batch
.start()
)
query.awaitTermination()

Q8. What are the output modes in Structured Streaming?

ModeDescriptionUse case
appendOnly new rows (no updates)Event logs, immutable streams
completeEntire result table every triggerAggregations without watermark
updateOnly rows that changed since last triggerStateful aggregations with watermark

append requires a watermark for aggregations (so Spark knows when a window is final). complete retains all state in memory β€” dangerous for unbounded data. update is most efficient for stateful operations with streaming sinks.


PySpark & ML

Q9. How does PySpark execute Python UDFs and what are the performance implications?

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# Python UDF β€” data crosses JVM↔Python boundary per row (slow!)
@udf(returnType=StringType())
def clean_text(text):
if text is None:
return ""
return text.strip().lower()
df.withColumn("clean", clean_text(col("raw_text"))) # Slow β€” serialization overhead
# Pandas UDF (vectorized) β€” processes batches, much faster
from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf(StringType())
def clean_text_fast(series: pd.Series) -> pd.Series:
return series.fillna("").str.strip().str.lower()
df.withColumn("clean", clean_text_fast(col("raw_text"))) # 10-100x faster
# Best: use built-in Spark SQL functions (no Python serialization at all)
from pyspark.sql.functions import lower, trim, coalesce, lit
df.withColumn("clean", lower(trim(coalesce(col("raw_text"), lit(""))))) # Fastest

Order of preference: Spark SQL built-ins > Pandas UDF (vectorized) > Python UDF.


Q10. How do you read and write partitioned data in Spark?

# Write partitioned by year and month
(df.write
.format("parquet")
.mode("overwrite")
.partitionBy("year", "month")
.option("compression", "snappy")
.save("s3://bucket/sales/")
)
# Resulting layout:
# s3://bucket/sales/year=2025/month=1/part-00000.parquet
# s3://bucket/sales/year=2025/month=2/part-00000.parquet
# Read β€” Spark prunes partitions automatically when filtering
df = spark.read.parquet("s3://bucket/sales/")
# This only reads year=2025/month=3/ β€” partition pruning!
q1 = df.filter("year = 2025 AND month = 3")
# Delta Lake β€” preferred over raw Parquet for production
(df.write
.format("delta")
.mode("append") # Or overwrite
.option("mergeSchema", "true") # Schema evolution
.partitionBy("year", "month")
.save("s3://bucket/delta/sales/")
)
# Delta MERGE (upsert)
from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "s3://bucket/delta/sales/")
target.alias("t").merge(
updates_df.alias("u"),
"t.order_id = u.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Q11. What is Delta Lake and why is it preferred over raw Parquet for production pipelines?

Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, and time travel to data lake storage:

FeatureRaw ParquetDelta Lake
ACID transactionsNoYes
Schema enforcementManualEnforced on write
Time travelNoYes (version history)
Upsert (MERGE)Painful rewriteNative
Streaming + batchSeparateUnified
CompactionManualOPTIMIZE command
# Time travel β€” read data as it was at a point in time
df_yesterday = spark.read.format("delta").option("timestampAsOf", "2025-06-15").load(path)
df_version5 = spark.read.format("delta").option("versionAsOf", "5").load(path)
# Audit history
from delta.tables import DeltaTable
DeltaTable.forPath(spark, path).history(10).show()

Delta Lake (from Databricks, now open-source and part of The Linux Foundation) is now a standard for production data lakehouses alongside Apache Iceberg.