Spark Lineage Graph (DAG)
Every RDD in Spark remembers exactly how it was built — the parent RDDs and the transformation applied. This chain of dependencies forms the lineage graph, also called the Directed Acyclic Graph (DAG). The lineage is Spark’s foundation for fault tolerance: if a partition is lost, Spark recomputes it from the lineage instead of replicating data across the cluster.
How Lineage Works
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Lineage Demo").getOrCreate()sc = spark.sparkContext
# Step 1: Root RDD — reads from storageraw_logs = sc.textFile("s3://bucket/weblogs/*.log")
# Step 2-5: Each transformation adds a link in the lineage chainparsed = raw_logs.map(lambda line: line.split("\t"))errors = parsed.filter(lambda parts: parts[2] == "ERROR")messages = errors.map(lambda parts: parts[3])counts = messages.map(lambda m: (m, 1)).reduceByKey(lambda a, b: a + b)
# Lineage: counts ← reduceByKey ← messages ← map ← errors ← filter ← raw_logs ← textFileInspecting Lineage with toDebugString
print(counts.toDebugString().decode("utf-8"))Output:
(8) PairwiseRDD[5] at reduceByKey at <stdin>:1 [] | MapPartitionsRDD[4] at map at <stdin>:1 [] | MapPartitionsRDD[3] at map at <stdin>:1 [] | FilteredRDD[2] at filter at <stdin>:1 [] | MapPartitionsRDD[1] at map at <stdin>:1 [] | s3://bucket/weblogs/*.log MapPartitionsRDD[0] at textFile at <stdin>:1 []The (8) means 8 partitions. Each indented line is one transformation step.
Narrow vs Wide Dependencies
Lineage dependency type determines how partitions map between parent and child:
NARROW DEPENDENCY WIDE DEPENDENCY (shuffle boundary)Partition A → Partition A Partition A → Partition A, B, CPartition B → Partition B Partition B → Partition A, B, CPartition C → Partition C Partition C → Partition A, B, C
(map, filter, flatMap) (reduceByKey, join, groupByKey)Wide dependencies create stage boundaries — Spark must complete the parent stage before the child can begin.
Fault Tolerance via Lineage
# Scenario: Executor 3 dies mid-job, losing its partition of "errors"# Spark's response:# 1. Detect missing partition# 2. Follow lineage: errors ← filter ← parsed ← map ← raw_logs# 3. Re-read only the affected S3 file chunk# 4. Re-run map → filter for that partition only# 5. Continue the job — no data loss
counts.count() # Completes even after executor failureDataFrame Lineage with explain()
from pyspark.sql import functions as F
df = spark.read.parquet("transactions.parquet") \ .filter(F.col("amount") > 100) \ .groupBy("category") \ .agg(F.sum("amount").alias("total"))
df.explain() # Physical plandf.explain(mode="extended") # Logical + optimized + physical plansdf.explain(mode="formatted") # Human-readable treeWhen Lineage Becomes a Problem
Long iterative algorithms build lineage over many iterations, causing stack overflows in very deep DAGs.
Solution: checkpoint to truncate the lineage:
sc.setCheckpointDir("hdfs://cluster/checkpoints/")
for iteration in range(50): rdd = rdd.map(transform).filter(predicate) if iteration % 5 == 0: rdd.checkpoint() rdd.count() # Materialize before truncatingAfter checkpoint(), the lineage restarts from the checkpoint file.
DAG Visualization in Spark UI
While a job runs, open http://localhost:4040 → Jobs → click a job → DAG Visualization:
- Blue boxes represent RDD/partition stages
- Stage boundaries appear at shuffle points
- Skipped stages (from cache) shown separately
- Task timelines show per-task execution details