Apache Spark Interview Questions and Answers
These questions cover Sparkβs architecture, APIs, and performance β whatβs tested for data engineer, ML engineer, and big data architect roles using Apache Spark.
Architecture & Core Concepts
Q1. What is Apache Spark and how does it differ from Hadoop MapReduce?
Spark is a distributed in-memory data processing engine for large-scale analytics, ML, and streaming. It provides unified APIs (Python, Scala, Java, R) for batch, streaming, SQL, and ML workloads.
| Feature | Apache Spark | Hadoop MapReduce |
|---|---|---|
| Processing | In-memory (spills to disk when needed) | Disk-based after each step |
| Speed | 10β100Γ faster for iterative algorithms | Slow for multi-step jobs |
| API | RDD, DataFrame, Dataset, SQL, Streaming | Map + Reduce (verbose) |
| Fault tolerance | RDD lineage / DAG recomputation | Job restarts |
| Machine learning | MLlib integrated | Separate tools (Mahout) |
| Streaming | Structured Streaming (native) | Storm integration |
Spark is the dominant choice for data engineering in 2025 β most cloud platforms (Databricks, AWS EMR, Google Dataproc) run Spark under the hood.
Q2. Explain Sparkβs architecture: Driver, Executors, Cluster Manager.
Client βββ Driver Program βββ SparkContext / SparkSession βββ DAG Scheduler β Physical Plan β Task Sets βββ Task Scheduler
Cluster Manager (YARN / Kubernetes / Standalone / Mesos) βββ Allocates resources, monitors executors
Worker Nodes βββ Executor (JVM process per worker) βββ Task (runs on one partition) βββ Block Manager (caches RDD/DataFrame partitions)- Driver β runs
main(), createsSparkSession, builds the DAG, schedules tasks - Executor β runs tasks, stores cached data; one executor per node (configurable)
- Task β processes one partition; tasks within a stage run in parallel
- Stage β set of tasks separated by shuffle boundaries
- Job β triggered by an action (
.collect(),.write()); may contain multiple stages
Q3. What is the difference between an RDD, DataFrame, and Dataset?
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("demo").getOrCreate()
# RDD β lowest level, untyped, JVM objectsrdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])doubled = rdd.map(lambda x: x * 2).filter(lambda x: x > 4)
# DataFrame β distributed table with named columns and types# Uses Catalyst optimizer β FASTER than RDD for most queriesdf = spark.read.parquet("s3://bucket/sales/")result = df.filter("amount > 100").groupBy("region").agg({"amount": "sum"})
# Dataset β type-safe DataFrame (Scala/Java only; Python has no Dataset)# Catches type errors at compile time in Scalacase class Sale(region: String, amount: Double)val ds: Dataset[Sale] = spark.read.parquet("...").as[Sale]Recommendation: always use DataFrames/Datasets over RDDs β Catalyst optimizer and Tungsten execution engine make them significantly faster. Only drop to RDD when you need fine-grained control over partitioning or need Python functions that DataFrames canβt express.
Spark SQL & DataFrames
Q4. Explain the Catalyst optimizer and how it improves query performance.
Catalyst is Sparkβs query optimizer β it rewrites and optimizes your DataFrame transformations before execution:
- Analysis β resolves column names and types against the schema
- Logical optimization β rule-based: predicate pushdown, constant folding, projection pruning
- Physical planning β generates multiple physical plans (e.g., BroadcastHashJoin vs SortMergeJoin)
- Code generation β generates optimized Java bytecode (Tungsten whole-stage codegen)
# You write this:df.filter("year = 2025").join(departments, "dept_id").select("name", "salary")
# Catalyst might rewrite to:# - Push filter before join (read less data)# - Prune columns (select only what's needed before join)# - Choose BroadcastHashJoin if departments is small# - Push filter down to Parquet reader (skip row groups)The optimizer makes DataFrame queries faster than equivalent RDD code in almost all cases β even badly written SQL benefits from these rewrites.
Q5. How do you optimize a slow Spark job?
1. Check the Spark UI first β identify skewed stages, shuffle spills, OOM errors.
2. Partitioning:
# Too few partitions β underutilized cores# Too many β scheduling overhead# Rule of thumb: 2-4 tasks per CPU core in cluster# Default shuffle partitions: 200 (often too many for small data, too few for large)spark.conf.set("spark.sql.shuffle.partitions", "400") # Tune to data size
# Repartition before expensive operationsdf = df.repartition(400, "dept_id") # Evenly distribute by dept_id3. Handle data skew:
# Problem: one partition has 90% of the data# Solution 1: Salting β add random key to break skewfrom pyspark.sql.functions import concat, col, lit, randimport math
df_salted = df.withColumn("salted_key", concat(col("skewed_key"), lit("_"), (rand() * 10).cast("int")))
# Solution 2: AQE (Adaptive Query Execution) β enabled by default in Spark 3.0+spark.conf.set("spark.sql.adaptive.enabled", "true")spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")4. Broadcast joins:
from pyspark.sql.functions import broadcast
# If one side < 10MB (default threshold), Spark auto-broadcasts# Force broadcast for up to ~100MBresult = large_df.join(broadcast(small_dim_df), "key")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "104857600") # 100MB5. Caching:
# Cache frequently reused DataFrameslookup = spark.read.parquet("s3://bucket/dim/").cache()lookup.count() # Trigger materialization
# Use DISK_ONLY for data larger than memoryfrom pyspark import StorageLeveldf.persist(StorageLevel.MEMORY_AND_DISK)6. File format: Always use Parquet (columnar, compressed, predicate pushdown). Avoid CSV for large data.
Q6. What is lazy evaluation in Spark and what triggers execution?
Spark builds a DAG of transformations without executing them. Execution only happens when an action is called:
# These are TRANSFORMATIONS β build the plan, nothing runsdf1 = spark.read.parquet("s3://sales/")df2 = df1.filter("year >= 2023")df3 = df2.groupBy("region").agg({"amount": "sum"})df4 = df3.orderBy("sum(amount)", ascending=False)
# These are ACTIONS β trigger executiondf4.show() # Runs all transformationsdf4.collect() # Returns all rows to driver (careful with large data!)df4.count() # Returns countdf4.write.parquet("s3://output/") # Writes resultdf4.first() # Returns first rowBenefits of lazy evaluation:
- Catalyst can see the full plan and optimize across all steps
- Predicate pushdown: filters applied at read time, not after loading all data
- No wasted computation if the job is cancelled early
Structured Streaming
Q7. How does Spark Structured Streaming work?
Structured Streaming treats streaming data as an unbounded table β the same DataFrame/SQL API works for batch and streaming:
from pyspark.sql.functions import window, col, sum
# Read from Kafkastream_df = (spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "broker:9092") .option("subscribe", "orders") .load() .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) AS json"))
from pyspark.sql.functions import from_json, schema_of_jsonfrom pyspark.sql.types import StructType, StringType, DoubleType
schema = StructType().add("order_id", StringType()).add("amount", DoubleType())orders = stream_df.select(from_json(col("json"), schema).alias("data")).select("data.*")
# Windowed aggregation β sum per 5-minute windowwindowed = (orders .withWatermark("event_time", "10 minutes") # Handle late data .groupBy(window(col("event_time"), "5 minutes"), "region") .agg(sum("amount").alias("total_amount")))
# Write to sinkquery = (windowed.writeStream .format("delta") # Or parquet, kafka, memory, console .outputMode("append") # append / complete / update .option("checkpointLocation", "s3://checkpoints/orders/") .trigger(processingTime="1 minute") # Or Trigger.AvailableNow() for micro-batch .start())
query.awaitTermination()Q8. What are the output modes in Structured Streaming?
| Mode | Description | Use case |
|---|---|---|
append | Only new rows (no updates) | Event logs, immutable streams |
complete | Entire result table every trigger | Aggregations without watermark |
update | Only rows that changed since last trigger | Stateful aggregations with watermark |
append requires a watermark for aggregations (so Spark knows when a window is final).
complete retains all state in memory β dangerous for unbounded data.
update is most efficient for stateful operations with streaming sinks.
PySpark & ML
Q9. How does PySpark execute Python UDFs and what are the performance implications?
from pyspark.sql.functions import udffrom pyspark.sql.types import StringType
# Python UDF β data crosses JVMβPython boundary per row (slow!)@udf(returnType=StringType())def clean_text(text): if text is None: return "" return text.strip().lower()
df.withColumn("clean", clean_text(col("raw_text"))) # Slow β serialization overhead
# Pandas UDF (vectorized) β processes batches, much fasterfrom pyspark.sql.functions import pandas_udfimport pandas as pd
@pandas_udf(StringType())def clean_text_fast(series: pd.Series) -> pd.Series: return series.fillna("").str.strip().str.lower()
df.withColumn("clean", clean_text_fast(col("raw_text"))) # 10-100x faster
# Best: use built-in Spark SQL functions (no Python serialization at all)from pyspark.sql.functions import lower, trim, coalesce, litdf.withColumn("clean", lower(trim(coalesce(col("raw_text"), lit(""))))) # FastestOrder of preference: Spark SQL built-ins > Pandas UDF (vectorized) > Python UDF.
Q10. How do you read and write partitioned data in Spark?
# Write partitioned by year and month(df.write .format("parquet") .mode("overwrite") .partitionBy("year", "month") .option("compression", "snappy") .save("s3://bucket/sales/"))
# Resulting layout:# s3://bucket/sales/year=2025/month=1/part-00000.parquet# s3://bucket/sales/year=2025/month=2/part-00000.parquet
# Read β Spark prunes partitions automatically when filteringdf = spark.read.parquet("s3://bucket/sales/")# This only reads year=2025/month=3/ β partition pruning!q1 = df.filter("year = 2025 AND month = 3")
# Delta Lake β preferred over raw Parquet for production(df.write .format("delta") .mode("append") # Or overwrite .option("mergeSchema", "true") # Schema evolution .partitionBy("year", "month") .save("s3://bucket/delta/sales/"))
# Delta MERGE (upsert)from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "s3://bucket/delta/sales/")target.alias("t").merge( updates_df.alias("u"), "t.order_id = u.order_id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()Q11. What is Delta Lake and why is it preferred over raw Parquet for production pipelines?
Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, and time travel to data lake storage:
| Feature | Raw Parquet | Delta Lake |
|---|---|---|
| ACID transactions | No | Yes |
| Schema enforcement | Manual | Enforced on write |
| Time travel | No | Yes (version history) |
| Upsert (MERGE) | Painful rewrite | Native |
| Streaming + batch | Separate | Unified |
| Compaction | Manual | OPTIMIZE command |
# Time travel β read data as it was at a point in timedf_yesterday = spark.read.format("delta").option("timestampAsOf", "2025-06-15").load(path)df_version5 = spark.read.format("delta").option("versionAsOf", "5").load(path)
# Audit historyfrom delta.tables import DeltaTableDeltaTable.forPath(spark, path).history(10).show()Delta Lake (from Databricks, now open-source and part of The Linux Foundation) is now a standard for production data lakehouses alongside Apache Iceberg.