What is an RDD in Apache Spark?
An RDD (Resilient Distributed Dataset) is Apache Spark’s foundational data abstraction — an immutable, distributed collection of objects that can be processed in parallel across a cluster. Every higher-level Spark API (DataFrame, Dataset) is built on top of RDDs.
The Three RDD Properties
1. Resilient — fault-tolerant through lineage. If a partition is lost (due to node failure), Spark can recompute it from the lineage graph without requiring data replication.
2. Distributed — partitioned across multiple nodes. Each partition is processed independently by one task on one executor core.
3. Dataset — a collection of serializable objects (strings, integers, tuples, case classes, or any Java/Python objects).
RDD Characteristics
| Property | Description |
|---|---|
| Immutable | Cannot be modified after creation; transformations produce new RDDs |
| Lazy | Transformations are not executed until an action triggers computation |
| Partitioned | Split into chunks distributed across the cluster |
| Typed | Strongly typed in Scala/Java; dynamically typed in Python |
| In-memory | Default storage is RAM; spills to disk when memory is exhausted |
Creating RDDs
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Basics").getOrCreate()sc = spark.sparkContext
# From a Python collectionrdd1 = sc.parallelize([1, 2, 3, 4, 5])rdd2 = sc.parallelize(range(1000), numSlices=8) # 8 partitions
# From a filerdd3 = sc.textFile("s3://bucket/logs/*.log")rdd4 = sc.textFile("hdfs://namenode/data.txt", minPartitions=4)
# From another RDD (transformation)rdd5 = rdd1.filter(lambda x: x % 2 == 0)rdd6 = rdd3.map(lambda line: line.upper())How Fault Tolerance Works
# Step 1: Create a chain of transformationsraw = sc.textFile("s3://logs.txt")filt = raw.filter(lambda l: "ERROR" in l)msgs = filt.map(lambda l: l.split("|")[2])
# Step 2: If an executor fails mid-job and loses a partition of "msgs":# - Spark does NOT need a backup copy of the data# - Spark replays: textFile → filter → map for that partition only# - Job continues without data loss
msgs.count() # Completes even if a node failsRDD Operations
rdd = sc.parallelize(["hello world", "apache spark", "big data"])
# Transformations (lazy)words = rdd.flatMap(lambda s: s.split()) # ["hello", "world", ...]lengths = words.map(lambda w: (w, len(w))) # [("hello", 5), ("world", 5), ...]long_w = words.filter(lambda w: len(w) > 4) # ["hello", "world", "apache", "spark"]
# Actions (trigger execution)long_w.collect() # ["hello", "world", "apache", "spark"]long_w.count() # 4long_w.take(2) # ["hello", "world"]RDD vs DataFrame: When to Use Each
In 2025, use DataFrames for the vast majority of work:
| Use RDDs When… | Use DataFrames When… |
|---|---|
| Processing unstructured data (raw text, binary) | Structured/semi-structured data |
| Complex custom logic that doesn’t map to DataFrame API | SQL-like transformations |
| Low-level control over partitioning | Automatic Catalyst optimization |
| Legacy Spark 1.x code | New code — always prefer DataFrames |
| Python closures with non-serializable objects | Standard column operations |
DataFrames are faster (Catalyst + Tungsten), safer (schema enforcement), and more concise.