Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Apache Spark Architecture

Spark follows a master/worker architecture with a central driver coordinating distributed executors. Understanding this architecture is essential for tuning performance and diagnosing failures.


High-Level Architecture

User Code (PySpark / Scala)
DRIVER PROGRAM
├── SparkContext / SparkSession
├── DAGScheduler — builds stage DAG from transformation graph
└── TaskScheduler — assigns tasks to executor cores
↓ (via Cluster Manager)
CLUSTER MANAGER (YARN / Kubernetes / Standalone)
├── Negotiates resources
└── Launches executor processes on Worker Nodes
WORKER NODE 1 WORKER NODE 2 WORKER NODE N
└── Executor └── Executor └── Executor
├── Task 1 ├── Task 5 ├── Task 9
├── Task 2 ├── Task 6 ├── Task 10
├── JVM memory ├── JVM memory ├── JVM memory
└── Block Manager └── Block Manager └── Block Manager

The Driver

The driver is the JVM process running the main() function (or the PySpark script). It:

from pyspark.sql import SparkSession
# This process IS the driver
spark = SparkSession.builder.appName("MyApp").getOrCreate()
sc = spark.sparkContext
# All planning happens here; data doesn't pass through the driver (except collect())
df = spark.read.parquet("s3://bucket/data/")
result = df.groupBy("region").sum("revenue")
# Only this line brings data to the driver — keep result small
result.show(20)

Executors

Executors are long-running JVM processes on worker nodes. Each executor:

# Tune executor resources at session creation
spark = SparkSession.builder \
.config("spark.executor.instances", "10") # 10 executor JVMs
.config("spark.executor.cores", "4") # 4 cores per executor
.config("spark.executor.memory", "8g") # 8 GB JVM heap per executor
.config("spark.executor.memoryOverhead", "2g") # Off-heap (shuffle, native)
.getOrCreate()

Cluster Managers

Spark delegates resource allocation to a cluster manager:

Cluster ManagerUse Case
StandaloneSimple Spark-only clusters; dev/test
YARNShared Hadoop clusters; Hadoop ecosystem integration
KubernetesCloud-native deployments; container isolation per job
MesosLegacy; largely replaced by Kubernetes
LocalDevelopment (local[*] uses all CPU cores)
# Run locally on 4 cores
spark = SparkSession.builder.master("local[4]").getOrCreate()
# Submit to YARN
# spark-submit --master yarn --deploy-mode cluster my_app.py
# Submit to Kubernetes
# spark-submit --master k8s://https://k8s-cluster:6443 my_app.py

Scheduler Components

Action called
RDD Graph / DataFrame plan
DAGScheduler
- Builds DAG
- Identifies shuffle boundaries → creates stages
- Submits stages with resolved dependencies
TaskScheduler
- Takes stages from DAGScheduler
- Creates one Task per partition
- Assigns tasks to available executor slots
- Handles task failures and retries
SchedulerBackend
- Communicates with Cluster Manager to acquire/release resources
- Launches executor processes

Memory Layout per Executor

Executor JVM Heap (e.g., 8 GB)
├── Reserved Memory: 300 MB (system use, not configurable)
├── User Memory (40%): 3.1 GB
│ └── UDFs, data structures defined in user code
└── Unified Memory (60%): 4.6 GB
├── Execution Memory (dynamic): shuffle, sort, aggregation
└── Storage Memory (dynamic): cached RDDs/DataFrames

Execution and storage memory share the unified pool dynamically via spark.memory.fraction (default: 0.6) and spark.memory.storageFraction (default: 0.5).