Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Lineage Graph (DAG)

Every RDD in Spark remembers exactly how it was built — the parent RDDs and the transformation applied. This chain of dependencies forms the lineage graph, also called the Directed Acyclic Graph (DAG). The lineage is Spark’s foundation for fault tolerance: if a partition is lost, Spark recomputes it from the lineage instead of replicating data across the cluster.


How Lineage Works

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Lineage Demo").getOrCreate()
sc = spark.sparkContext
# Step 1: Root RDD — reads from storage
raw_logs = sc.textFile("s3://bucket/weblogs/*.log")
# Step 2-5: Each transformation adds a link in the lineage chain
parsed = raw_logs.map(lambda line: line.split("\t"))
errors = parsed.filter(lambda parts: parts[2] == "ERROR")
messages = errors.map(lambda parts: parts[3])
counts = messages.map(lambda m: (m, 1)).reduceByKey(lambda a, b: a + b)
# Lineage: counts ← reduceByKey ← messages ← map ← errors ← filter ← raw_logs ← textFile

Inspecting Lineage with toDebugString

print(counts.toDebugString().decode("utf-8"))

Output:

(8) PairwiseRDD[5] at reduceByKey at <stdin>:1 []
| MapPartitionsRDD[4] at map at <stdin>:1 []
| MapPartitionsRDD[3] at map at <stdin>:1 []
| FilteredRDD[2] at filter at <stdin>:1 []
| MapPartitionsRDD[1] at map at <stdin>:1 []
| s3://bucket/weblogs/*.log MapPartitionsRDD[0] at textFile at <stdin>:1 []

The (8) means 8 partitions. Each indented line is one transformation step.


Narrow vs Wide Dependencies

Lineage dependency type determines how partitions map between parent and child:

NARROW DEPENDENCY WIDE DEPENDENCY (shuffle boundary)
Partition A → Partition A Partition A → Partition A, B, C
Partition B → Partition B Partition B → Partition A, B, C
Partition C → Partition C Partition C → Partition A, B, C
(map, filter, flatMap) (reduceByKey, join, groupByKey)

Wide dependencies create stage boundaries — Spark must complete the parent stage before the child can begin.


Fault Tolerance via Lineage

# Scenario: Executor 3 dies mid-job, losing its partition of "errors"
# Spark's response:
# 1. Detect missing partition
# 2. Follow lineage: errors ← filter ← parsed ← map ← raw_logs
# 3. Re-read only the affected S3 file chunk
# 4. Re-run map → filter for that partition only
# 5. Continue the job — no data loss
counts.count() # Completes even after executor failure

DataFrame Lineage with explain()

from pyspark.sql import functions as F
df = spark.read.parquet("transactions.parquet") \
.filter(F.col("amount") > 100) \
.groupBy("category") \
.agg(F.sum("amount").alias("total"))
df.explain() # Physical plan
df.explain(mode="extended") # Logical + optimized + physical plans
df.explain(mode="formatted") # Human-readable tree

When Lineage Becomes a Problem

Long iterative algorithms build lineage over many iterations, causing stack overflows in very deep DAGs.

Solution: checkpoint to truncate the lineage:

sc.setCheckpointDir("hdfs://cluster/checkpoints/")
for iteration in range(50):
rdd = rdd.map(transform).filter(predicate)
if iteration % 5 == 0:
rdd.checkpoint()
rdd.count() # Materialize before truncating

After checkpoint(), the lineage restarts from the checkpoint file.


DAG Visualization in Spark UI

While a job runs, open http://localhost:4040Jobs → click a job → DAG Visualization: