Apache Spark Architecture

Spark follows a master/worker architecture with a central driver coordinating distributed executors. Understanding this architecture is essential for tuning performance and diagnosing failures.

High-Level Architecture

User Code (PySpark / Scala)
        ↓
    DRIVER PROGRAM
    ├── SparkContext / SparkSession
    ├── DAGScheduler    — builds stage DAG from transformation graph
    └── TaskScheduler   — assigns tasks to executor cores

        ↓ (via Cluster Manager)

CLUSTER MANAGER (YARN / Kubernetes / Standalone)
    ├── Negotiates resources
    └── Launches executor processes on Worker Nodes

WORKER NODE 1          WORKER NODE 2          WORKER NODE N
└── Executor           └── Executor           └── Executor
    ├── Task 1             ├── Task 5             ├── Task 9
    ├── Task 2             ├── Task 6             ├── Task 10
    ├── JVM memory         ├── JVM memory         ├── JVM memory
    └── Block Manager      └── Block Manager      └── Block Manager

The Driver

The driver is the JVM process running the main() function (or the PySpark script). It:

Creates the SparkSession / SparkContext
Analyzes user code to build the execution DAG
Breaks the DAG into stages and tasks
Schedules tasks on executors
Collects results from executors for collect() / count() actions

from pyspark.sql import SparkSession

# This process IS the driver
spark = SparkSession.builder.appName("MyApp").getOrCreate()
sc = spark.sparkContext

# All planning happens here; data doesn't pass through the driver (except collect())
df = spark.read.parquet("s3://bucket/data/")
result = df.groupBy("region").sum("revenue")

# Only this line brings data to the driver — keep result small
result.show(20)

Executors

Executors are long-running JVM processes on worker nodes. Each executor:

Receives serialized tasks from the driver
Runs tasks in parallel across its CPU cores
Stores cached RDD/DataFrame partitions (Block Manager)
Returns task results and shuffle data

# Tune executor resources at session creation
spark = SparkSession.builder \
    .config("spark.executor.instances", "10")    # 10 executor JVMs
    .config("spark.executor.cores",     "4")     # 4 cores per executor
    .config("spark.executor.memory",    "8g")    # 8 GB JVM heap per executor
    .config("spark.executor.memoryOverhead", "2g") # Off-heap (shuffle, native)
    .getOrCreate()

Cluster Managers

Spark delegates resource allocation to a cluster manager:

Cluster Manager	Use Case
Standalone	Simple Spark-only clusters; dev/test
YARN	Shared Hadoop clusters; Hadoop ecosystem integration
Kubernetes	Cloud-native deployments; container isolation per job
Mesos	Legacy; largely replaced by Kubernetes
Local	Development (`local[*]` uses all CPU cores)

# Run locally on 4 cores
spark = SparkSession.builder.master("local[4]").getOrCreate()

# Submit to YARN
# spark-submit --master yarn --deploy-mode cluster my_app.py

# Submit to Kubernetes
# spark-submit --master k8s://https://k8s-cluster:6443 my_app.py

Scheduler Components

Action called
    ↓
RDD Graph / DataFrame plan
    ↓
DAGScheduler
  - Builds DAG
  - Identifies shuffle boundaries → creates stages
  - Submits stages with resolved dependencies
    ↓
TaskScheduler
  - Takes stages from DAGScheduler
  - Creates one Task per partition
  - Assigns tasks to available executor slots
  - Handles task failures and retries
    ↓
SchedulerBackend
  - Communicates with Cluster Manager to acquire/release resources
  - Launches executor processes

Memory Layout per Executor

Executor JVM Heap (e.g., 8 GB)
├── Reserved Memory: 300 MB (system use, not configurable)
├── User Memory (40%): 3.1 GB
│   └── UDFs, data structures defined in user code
└── Unified Memory (60%): 4.6 GB
    ├── Execution Memory (dynamic): shuffle, sort, aggregation
    └── Storage Memory (dynamic): cached RDDs/DataFrames

Execution and storage memory share the unified pool dynamically via spark.memory.fraction (default: 0.6) and spark.memory.storageFraction (default: 0.5).