Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Transformations

A transformation in Spark produces a new RDD or DataFrame from an existing one without immediately running any computation. Spark records the transformation in a lineage graph (DAG) and defers execution until an action is called. This lazy approach lets Spark optimize the entire chain of operations before a single byte of data moves.


Narrow vs Wide Transformations

This is one of Spark’s most important architectural concepts:

PropertyNarrowWide
Data movementEach partition maps to one output partitionMultiple input partitions contribute to one output partition
Network shuffleNoneYes — data crosses the network
Stage boundaryNoYes — creates a new stage
Examplesmap, filter, flatMap, mapPartitionsgroupByKey, reduceByKey, join, distinct, repartition

Minimize wide transformations (shuffles) for better performance.


Element-Level Transformations (Narrow)

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("Transformations").getOrCreate()
sc = spark.sparkContext
# map — one element in, one element out
numbers = sc.parallelize([1, 2, 3, 4, 5])
squared = numbers.map(lambda x: x ** 2) # [1, 4, 9, 16, 25]
# filter — keep elements matching predicate
evens = numbers.filter(lambda x: x % 2 == 0) # [2, 4]
# flatMap — one element in, zero or more out
sentences = sc.parallelize(["hello world", "apache spark is fast"])
words = sentences.flatMap(lambda s: s.split())
# ["hello", "world", "apache", "spark", "is", "fast"]
# mapPartitions — process an entire partition at once (more efficient than map for setup costs)
def process_partition(records):
# Open DB connection once per partition, not per record
for record in records:
yield record.upper()
result = sc.parallelize(["a", "b", "c"]).mapPartitions(process_partition)

DataFrame Transformations

data = [("Alice", "Eng", 95000), ("Bob", "Mkt", 72000), ("Carol", "Eng", 110000)]
df = spark.createDataFrame(data, ["name", "dept", "salary"])
# select — column projection
df.select("name", "salary")
df.select(F.col("name"), (F.col("salary") * 1.1).alias("bumped_salary"))
# filter / where
df.filter(F.col("salary") > 80000)
df.where("dept = 'Eng'")
df.filter((F.col("dept") == "Eng") & (F.col("salary") > 90000))
# withColumn — add or replace a column
df.withColumn("is_senior", F.col("salary") > 100000)
df.withColumn("dept_upper", F.upper(F.col("dept")))
# drop — remove columns
df.drop("dept")
# distinct — remove duplicate rows
df.distinct()
df.dropDuplicates(["dept"]) # Deduplicate on specific columns
# orderBy / sort
df.orderBy(F.col("salary").desc())
df.sort("dept", F.col("salary").desc())
# limit
df.limit(100)

Aggregation Transformations (Wide)

# groupBy + aggregate
df.groupBy("dept").agg(
F.count("*").alias("headcount"),
F.avg("salary").alias("avg_salary"),
F.max("salary").alias("max_salary"),
)
# pivot
quarterly = spark.createDataFrame([
("Q1", "Electronics", 50000), ("Q2", "Electronics", 60000),
("Q1", "Clothing", 30000), ("Q2", "Clothing", 35000),
], ["quarter", "category", "revenue"])
quarterly.groupBy("category").pivot("quarter", ["Q1", "Q2"]).sum("revenue").show()
# +----------+-----+-----+
# |category |Q1 |Q2 |
# +----------+-----+-----+
# |Electronics|50000|60000|
# |Clothing |30000|35000|

Key-Value RDD Transformations

pairs = sc.parallelize([("apple", 3), ("banana", 5), ("apple", 2), ("cherry", 1)])
# reduceByKey — merges per key locally first (efficient)
pairs.reduceByKey(lambda a, b: a + b)
# [("apple", 5), ("banana", 5), ("cherry", 1)]
# groupByKey — collects all values per key (can cause OOM for large datasets)
pairs.groupByKey().mapValues(list)
# [("apple", [3, 2]), ("banana", [5]), ("cherry", [1])]
# aggregateByKey — custom combine logic within and between partitions
pairs.aggregateByKey(
zeroValue=0,
seqFunc=lambda acc, v: acc + v, # Within partition
combFunc=lambda a, b: a + b # Between partitions
)
# sortByKey
pairs.sortByKey(ascending=True)
# keys / values
pairs.keys() # ["apple", "banana", "apple", "cherry"]
pairs.values() # [3, 5, 2, 1]

Transformation Chaining

Spark is optimized for long chains — Catalyst merges adjacent narrow transformations into a single pass:

# All narrow transformations fuse into one physical stage
result = (
df
.filter(F.col("dept") == "Eng") # Narrow
.withColumn("tax", F.col("salary") * 0.3) # Narrow
.select("name", "salary", "tax") # Narrow
.orderBy("salary") # Wide (sort requires shuffle)
.limit(5)
)
result.show()