Apache Spark Architecture
Spark follows a master/worker architecture with a central driver coordinating distributed executors. Understanding this architecture is essential for tuning performance and diagnosing failures.
High-Level Architecture
User Code (PySpark / Scala) ↓ DRIVER PROGRAM ├── SparkContext / SparkSession ├── DAGScheduler — builds stage DAG from transformation graph └── TaskScheduler — assigns tasks to executor cores
↓ (via Cluster Manager)
CLUSTER MANAGER (YARN / Kubernetes / Standalone) ├── Negotiates resources └── Launches executor processes on Worker Nodes
WORKER NODE 1 WORKER NODE 2 WORKER NODE N└── Executor └── Executor └── Executor ├── Task 1 ├── Task 5 ├── Task 9 ├── Task 2 ├── Task 6 ├── Task 10 ├── JVM memory ├── JVM memory ├── JVM memory └── Block Manager └── Block Manager └── Block ManagerThe Driver
The driver is the JVM process running the main() function (or the PySpark script). It:
- Creates the
SparkSession/SparkContext - Analyzes user code to build the execution DAG
- Breaks the DAG into stages and tasks
- Schedules tasks on executors
- Collects results from executors for
collect()/count()actions
from pyspark.sql import SparkSession
# This process IS the driverspark = SparkSession.builder.appName("MyApp").getOrCreate()sc = spark.sparkContext
# All planning happens here; data doesn't pass through the driver (except collect())df = spark.read.parquet("s3://bucket/data/")result = df.groupBy("region").sum("revenue")
# Only this line brings data to the driver — keep result smallresult.show(20)Executors
Executors are long-running JVM processes on worker nodes. Each executor:
- Receives serialized tasks from the driver
- Runs tasks in parallel across its CPU cores
- Stores cached RDD/DataFrame partitions (Block Manager)
- Returns task results and shuffle data
# Tune executor resources at session creationspark = SparkSession.builder \ .config("spark.executor.instances", "10") # 10 executor JVMs .config("spark.executor.cores", "4") # 4 cores per executor .config("spark.executor.memory", "8g") # 8 GB JVM heap per executor .config("spark.executor.memoryOverhead", "2g") # Off-heap (shuffle, native) .getOrCreate()Cluster Managers
Spark delegates resource allocation to a cluster manager:
| Cluster Manager | Use Case |
|---|---|
| Standalone | Simple Spark-only clusters; dev/test |
| YARN | Shared Hadoop clusters; Hadoop ecosystem integration |
| Kubernetes | Cloud-native deployments; container isolation per job |
| Mesos | Legacy; largely replaced by Kubernetes |
| Local | Development (local[*] uses all CPU cores) |
# Run locally on 4 coresspark = SparkSession.builder.master("local[4]").getOrCreate()
# Submit to YARN# spark-submit --master yarn --deploy-mode cluster my_app.py
# Submit to Kubernetes# spark-submit --master k8s://https://k8s-cluster:6443 my_app.pyScheduler Components
Action called ↓RDD Graph / DataFrame plan ↓DAGScheduler - Builds DAG - Identifies shuffle boundaries → creates stages - Submits stages with resolved dependencies ↓TaskScheduler - Takes stages from DAGScheduler - Creates one Task per partition - Assigns tasks to available executor slots - Handles task failures and retries ↓SchedulerBackend - Communicates with Cluster Manager to acquire/release resources - Launches executor processesMemory Layout per Executor
Executor JVM Heap (e.g., 8 GB)├── Reserved Memory: 300 MB (system use, not configurable)├── User Memory (40%): 3.1 GB│ └── UDFs, data structures defined in user code└── Unified Memory (60%): 4.6 GB ├── Execution Memory (dynamic): shuffle, sort, aggregation └── Storage Memory (dynamic): cached RDDs/DataFramesExecution and storage memory share the unified pool dynamically via spark.memory.fraction (default: 0.6) and spark.memory.storageFraction (default: 0.5).