Creating an RDD from a Collection
sc.parallelize() is the simplest way to create an RDD — it distributes a Python collection (list, range, or any iterable) across the cluster’s partitions. This is the starting point for learning Spark transformations without needing real data files.
Basic parallelize()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDD Collection").getOrCreate()sc = spark.sparkContext
# From a list of integersrdd1 = sc.parallelize([1, 2, 3, 4, 5])print(rdd1.collect()) # [1, 2, 3, 4, 5]
# From a rangerdd2 = sc.parallelize(range(1000))print(rdd2.count()) # 1000
# From a list of stringsrdd3 = sc.parallelize(["apple", "banana", "cherry"])rdd3.foreach(print)
# From a list of tuples (key-value pairs)pairs = sc.parallelize([("a", 1), ("b", 2), ("a", 3), ("c", 1)])pairs.reduceByKey(lambda x, y: x + y).collect()# [("a", 4), ("b", 2), ("c", 1)]Controlling the Number of Partitions
The numSlices parameter sets how many partitions the collection is split into:
# Default: sc.defaultParallelism (usually 2 × CPU cores)rdd_default = sc.parallelize(range(100))print(rdd_default.getNumPartitions()) # e.g., 8 on 4 cores
# Explicit partition countrdd_4 = sc.parallelize(range(100), numSlices=4)print(rdd_4.getNumPartitions()) # 4
rdd_1 = sc.parallelize(range(100), numSlices=1)print(rdd_1.getNumPartitions()) # 1
# Too few partitions → underutilizes the cluster# Too many partitions → scheduling overhead dominates (for small data)# Rule of thumb for small in-memory data: 2-4 × number of coresWhat Each Partition Gets
# See which elements are in which partitionrdd = sc.parallelize(range(10), numSlices=4)rdd.mapPartitionsWithIndex( lambda i, it: [(i, list(it))]).collect()# [(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7]), (3, [8, 9])]Various Collection Types
# Nested listsnested = sc.parallelize([[1, 2, 3], [4, 5], [6, 7, 8, 9]])nested.map(sum).collect() # [6, 9, 30]
# Dictionariesdicts = sc.parallelize([ {"name": "Alice", "score": 92}, {"name": "Bob", "score": 78},])dicts.map(lambda d: (d["name"], d["score"])).collect()
# Mixed types (possible in Python, but avoid it)mixed = sc.parallelize([1, "two", 3.0])mixed.collect() # [1, "two", 3.0]parallelize() vs Realistic Data Loading
parallelize() is ideal for testing but has real-world limitations:
# FINE for:# - Unit tests# - Learning/experimenting# - Small lookup tables to broadcastsmall_lookup = sc.parallelize(list(range(100)))
# Use file-based reading for production:# - Data that doesn't fit in driver memory can't be parallelized from a Python list# - sc.textFile() / spark.read.parquet() streams data directly from storageproduction_rdd = sc.textFile("s3://bucket/large-data/*.txt")Converting Parallelized RDD to DataFrame
from pyspark.sql import Row
data = [ Row(name="Alice", salary=95000, dept="Engineering"), Row(name="Bob", salary=72000, dept="Marketing"),]rdd = sc.parallelize(data)df = spark.createDataFrame(rdd)df.show()