Spark Lineage Graph (DAG)

Every RDD in Spark remembers exactly how it was built — the parent RDDs and the transformation applied. This chain of dependencies forms the lineage graph, also called the Directed Acyclic Graph (DAG). The lineage is Spark’s foundation for fault tolerance: if a partition is lost, Spark recomputes it from the lineage instead of replicating data across the cluster.

How Lineage Works

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Lineage Demo").getOrCreate()
sc = spark.sparkContext

# Step 1: Root RDD — reads from storage
raw_logs = sc.textFile("s3://bucket/weblogs/*.log")

# Step 2-5: Each transformation adds a link in the lineage chain
parsed   = raw_logs.map(lambda line: line.split("\t"))
errors   = parsed.filter(lambda parts: parts[2] == "ERROR")
messages = errors.map(lambda parts: parts[3])
counts   = messages.map(lambda m: (m, 1)).reduceByKey(lambda a, b: a + b)

# Lineage: counts ← reduceByKey ← messages ← map ← errors ← filter ← raw_logs ← textFile

Inspecting Lineage with toDebugString

print(counts.toDebugString().decode("utf-8"))

Output:

(8) PairwiseRDD[5] at reduceByKey at <stdin>:1 []
 |  MapPartitionsRDD[4] at map at <stdin>:1 []
 |  MapPartitionsRDD[3] at map at <stdin>:1 []
 |  FilteredRDD[2] at filter at <stdin>:1 []
 |  MapPartitionsRDD[1] at map at <stdin>:1 []
 |  s3://bucket/weblogs/*.log MapPartitionsRDD[0] at textFile at <stdin>:1 []

The (8) means 8 partitions. Each indented line is one transformation step.

Narrow vs Wide Dependencies

Lineage dependency type determines how partitions map between parent and child:

NARROW DEPENDENCY             WIDE DEPENDENCY (shuffle boundary)
Partition A → Partition A     Partition A → Partition A, B, C
Partition B → Partition B     Partition B → Partition A, B, C
Partition C → Partition C     Partition C → Partition A, B, C

(map, filter, flatMap)        (reduceByKey, join, groupByKey)

Wide dependencies create stage boundaries — Spark must complete the parent stage before the child can begin.

Fault Tolerance via Lineage

# Scenario: Executor 3 dies mid-job, losing its partition of "errors"
# Spark's response:
#   1. Detect missing partition
#   2. Follow lineage: errors ← filter ← parsed ← map ← raw_logs
#   3. Re-read only the affected S3 file chunk
#   4. Re-run map → filter for that partition only
#   5. Continue the job — no data loss

counts.count()   # Completes even after executor failure

DataFrame Lineage with explain()

from pyspark.sql import functions as F

df = spark.read.parquet("transactions.parquet") \
    .filter(F.col("amount") > 100) \
    .groupBy("category") \
    .agg(F.sum("amount").alias("total"))

df.explain()                   # Physical plan
df.explain(mode="extended")    # Logical + optimized + physical plans
df.explain(mode="formatted")   # Human-readable tree

When Lineage Becomes a Problem

Long iterative algorithms build lineage over many iterations, causing stack overflows in very deep DAGs.

Solution: checkpoint to truncate the lineage:

sc.setCheckpointDir("hdfs://cluster/checkpoints/")

for iteration in range(50):
    rdd = rdd.map(transform).filter(predicate)
    if iteration % 5 == 0:
        rdd.checkpoint()
        rdd.count()   # Materialize before truncating

After checkpoint(), the lineage restarts from the checkpoint file.

DAG Visualization in Spark UI

While a job runs, open http://localhost:4040 → Jobs → click a job → DAG Visualization:

Blue boxes represent RDD/partition stages
Stage boundaries appear at shuffle points
Skipped stages (from cache) shown separately
Task timelines show per-task execution details