Apache Spark Interview Questions and Answers

These questions cover Spark’s architecture, APIs, and performance — what’s tested for data engineer, ML engineer, and big data architect roles using Apache Spark.

Architecture & Core Concepts

Q1. What is Apache Spark and how does it differ from Hadoop MapReduce?

Spark is a distributed in-memory data processing engine for large-scale analytics, ML, and streaming. It provides unified APIs (Python, Scala, Java, R) for batch, streaming, SQL, and ML workloads.

Feature	Apache Spark	Hadoop MapReduce
Processing	In-memory (spills to disk when needed)	Disk-based after each step
Speed	10–100× faster for iterative algorithms	Slow for multi-step jobs
API	RDD, DataFrame, Dataset, SQL, Streaming	Map + Reduce (verbose)
Fault tolerance	RDD lineage / DAG recomputation	Job restarts
Machine learning	MLlib integrated	Separate tools (Mahout)
Streaming	Structured Streaming (native)	Storm integration

Spark is the dominant choice for data engineering in 2025 — most cloud platforms (Databricks, AWS EMR, Google Dataproc) run Spark under the hood.

Q2. Explain Spark’s architecture: Driver, Executors, Cluster Manager.

Client
  └── Driver Program
        ├── SparkContext / SparkSession
        ├── DAG Scheduler → Physical Plan → Task Sets
        └── Task Scheduler

Cluster Manager (YARN / Kubernetes / Standalone / Mesos)
  └── Allocates resources, monitors executors

Worker Nodes
  └── Executor (JVM process per worker)
        ├── Task (runs on one partition)
        └── Block Manager (caches RDD/DataFrame partitions)

Driver — runs main(), creates SparkSession, builds the DAG, schedules tasks
Executor — runs tasks, stores cached data; one executor per node (configurable)
Task — processes one partition; tasks within a stage run in parallel
Stage — set of tasks separated by shuffle boundaries
Job — triggered by an action (.collect(), .write()); may contain multiple stages

Q3. What is the difference between an RDD, DataFrame, and Dataset?

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()

# RDD — lowest level, untyped, JVM objects
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
doubled = rdd.map(lambda x: x * 2).filter(lambda x: x > 4)

# DataFrame — distributed table with named columns and types
# Uses Catalyst optimizer — FASTER than RDD for most queries
df = spark.read.parquet("s3://bucket/sales/")
result = df.filter("amount > 100").groupBy("region").agg({"amount": "sum"})

# Dataset — type-safe DataFrame (Scala/Java only; Python has no Dataset)
# Catches type errors at compile time in Scala
case class Sale(region: String, amount: Double)
val ds: Dataset[Sale] = spark.read.parquet("...").as[Sale]

Recommendation: always use DataFrames/Datasets over RDDs — Catalyst optimizer and Tungsten execution engine make them significantly faster. Only drop to RDD when you need fine-grained control over partitioning or need Python functions that DataFrames can’t express.

Spark SQL & DataFrames

Q4. Explain the Catalyst optimizer and how it improves query performance.

Catalyst is Spark’s query optimizer — it rewrites and optimizes your DataFrame transformations before execution:

Analysis — resolves column names and types against the schema
Logical optimization — rule-based: predicate pushdown, constant folding, projection pruning
Physical planning — generates multiple physical plans (e.g., BroadcastHashJoin vs SortMergeJoin)
Code generation — generates optimized Java bytecode (Tungsten whole-stage codegen)

# You write this:
df.filter("year = 2025").join(departments, "dept_id").select("name", "salary")

# Catalyst might rewrite to:
# - Push filter before join (read less data)
# - Prune columns (select only what's needed before join)
# - Choose BroadcastHashJoin if departments is small
# - Push filter down to Parquet reader (skip row groups)

The optimizer makes DataFrame queries faster than equivalent RDD code in almost all cases — even badly written SQL benefits from these rewrites.

Q5. How do you optimize a slow Spark job?

1. Check the Spark UI first — identify skewed stages, shuffle spills, OOM errors.

2. Partitioning:

# Too few partitions → underutilized cores
# Too many → scheduling overhead
# Rule of thumb: 2-4 tasks per CPU core in cluster
# Default shuffle partitions: 200 (often too many for small data, too few for large)
spark.conf.set("spark.sql.shuffle.partitions", "400")  # Tune to data size

# Repartition before expensive operations
df = df.repartition(400, "dept_id")  # Evenly distribute by dept_id

3. Handle data skew:

# Problem: one partition has 90% of the data
# Solution 1: Salting — add random key to break skew
from pyspark.sql.functions import concat, col, lit, rand
import math

df_salted = df.withColumn("salted_key",
    concat(col("skewed_key"), lit("_"), (rand() * 10).cast("int")))

# Solution 2: AQE (Adaptive Query Execution) — enabled by default in Spark 3.0+
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

4. Broadcast joins:

from pyspark.sql.functions import broadcast

# If one side < 10MB (default threshold), Spark auto-broadcasts
# Force broadcast for up to ~100MB
result = large_df.join(broadcast(small_dim_df), "key")

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600")  # 100MB

5. Caching:

# Cache frequently reused DataFrames
lookup = spark.read.parquet("s3://bucket/dim/").cache()
lookup.count()  # Trigger materialization

# Use DISK_ONLY for data larger than memory
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)

6. File format: Always use Parquet (columnar, compressed, predicate pushdown). Avoid CSV for large data.

Q6. What is lazy evaluation in Spark and what triggers execution?

Spark builds a DAG of transformations without executing them. Execution only happens when an action is called:

# These are TRANSFORMATIONS — build the plan, nothing runs
df1 = spark.read.parquet("s3://sales/")
df2 = df1.filter("year >= 2023")
df3 = df2.groupBy("region").agg({"amount": "sum"})
df4 = df3.orderBy("sum(amount)", ascending=False)

# These are ACTIONS — trigger execution
df4.show()          # Runs all transformations
df4.collect()       # Returns all rows to driver (careful with large data!)
df4.count()         # Returns count
df4.write.parquet("s3://output/")  # Writes result
df4.first()         # Returns first row

Benefits of lazy evaluation:

Catalyst can see the full plan and optimize across all steps
Predicate pushdown: filters applied at read time, not after loading all data
No wasted computation if the job is cancelled early

Structured Streaming

Q7. How does Spark Structured Streaming work?

Structured Streaming treats streaming data as an unbounded table — the same DataFrame/SQL API works for batch and streaming:

from pyspark.sql.functions import window, col, sum

# Read from Kafka
stream_df = (spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "broker:9092")
  .option("subscribe", "orders")
  .load()
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) AS json")
)

from pyspark.sql.functions import from_json, schema_of_json
from pyspark.sql.types import StructType, StringType, DoubleType

schema = StructType().add("order_id", StringType()).add("amount", DoubleType())
orders = stream_df.select(from_json(col("json"), schema).alias("data")).select("data.*")

# Windowed aggregation — sum per 5-minute window
windowed = (orders
  .withWatermark("event_time", "10 minutes")   # Handle late data
  .groupBy(window(col("event_time"), "5 minutes"), "region")
  .agg(sum("amount").alias("total_amount"))
)

# Write to sink
query = (windowed.writeStream
  .format("delta")            # Or parquet, kafka, memory, console
  .outputMode("append")       # append / complete / update
  .option("checkpointLocation", "s3://checkpoints/orders/")
  .trigger(processingTime="1 minute")  # Or Trigger.AvailableNow() for micro-batch
  .start()
)

query.awaitTermination()

Q8. What are the output modes in Structured Streaming?

Mode	Description	Use case
`append`	Only new rows (no updates)	Event logs, immutable streams
`complete`	Entire result table every trigger	Aggregations without watermark
`update`	Only rows that changed since last trigger	Stateful aggregations with watermark

append requires a watermark for aggregations (so Spark knows when a window is final). complete retains all state in memory — dangerous for unbounded data. update is most efficient for stateful operations with streaming sinks.

PySpark & ML

Q9. How does PySpark execute Python UDFs and what are the performance implications?

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Python UDF — data crosses JVM↔Python boundary per row (slow!)
@udf(returnType=StringType())
def clean_text(text):
    if text is None:
        return ""
    return text.strip().lower()

df.withColumn("clean", clean_text(col("raw_text")))  # Slow — serialization overhead

# Pandas UDF (vectorized) — processes batches, much faster
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def clean_text_fast(series: pd.Series) -> pd.Series:
    return series.fillna("").str.strip().str.lower()

df.withColumn("clean", clean_text_fast(col("raw_text")))  # 10-100x faster

# Best: use built-in Spark SQL functions (no Python serialization at all)
from pyspark.sql.functions import lower, trim, coalesce, lit
df.withColumn("clean", lower(trim(coalesce(col("raw_text"), lit("")))))  # Fastest

Order of preference: Spark SQL built-ins > Pandas UDF (vectorized) > Python UDF.

Q10. How do you read and write partitioned data in Spark?

# Write partitioned by year and month
(df.write
  .format("parquet")
  .mode("overwrite")
  .partitionBy("year", "month")
  .option("compression", "snappy")
  .save("s3://bucket/sales/")
)

# Resulting layout:
# s3://bucket/sales/year=2025/month=1/part-00000.parquet
# s3://bucket/sales/year=2025/month=2/part-00000.parquet

# Read — Spark prunes partitions automatically when filtering
df = spark.read.parquet("s3://bucket/sales/")
# This only reads year=2025/month=3/ — partition pruning!
q1 = df.filter("year = 2025 AND month = 3")

# Delta Lake — preferred over raw Parquet for production
(df.write
  .format("delta")
  .mode("append")                                    # Or overwrite
  .option("mergeSchema", "true")                     # Schema evolution
  .partitionBy("year", "month")
  .save("s3://bucket/delta/sales/")
)

# Delta MERGE (upsert)
from delta.tables import DeltaTable

target = DeltaTable.forPath(spark, "s3://bucket/delta/sales/")
target.alias("t").merge(
  updates_df.alias("u"),
  "t.order_id = u.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Q11. What is Delta Lake and why is it preferred over raw Parquet for production pipelines?

Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, and time travel to data lake storage:

Feature	Raw Parquet	Delta Lake
ACID transactions	No	Yes
Schema enforcement	Manual	Enforced on write
Time travel	No	Yes (version history)
Upsert (MERGE)	Painful rewrite	Native
Streaming + batch	Separate	Unified
Compaction	Manual	`OPTIMIZE` command

# Time travel — read data as it was at a point in time
df_yesterday = spark.read.format("delta").option("timestampAsOf", "2025-06-15").load(path)
df_version5  = spark.read.format("delta").option("versionAsOf", "5").load(path)

# Audit history
from delta.tables import DeltaTable
DeltaTable.forPath(spark, path).history(10).show()

Delta Lake (from Databricks, now open-source and part of The Linux Foundation) is now a standard for production data lakehouses alongside Apache Iceberg.