Spark Partitions

A partition is the smallest unit of data distribution in Spark — a slice of your dataset that one task processes on one executor core. The number of partitions directly controls the degree of parallelism: too few and you waste cluster resources; too many and task scheduling overhead dominates.

How Spark Determines Partition Count

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("Partitions").getOrCreate()
sc = spark.sparkContext

# Default: sc.defaultParallelism (usually 2 × num executor cores)
print(sc.defaultParallelism)   # e.g., 8 on 4 cores

# From file: driven by HDFS block size (default 128 MB)
df = spark.read.parquet("s3://bucket/large-table/")
print(df.rdd.getNumPartitions())   # 1 partition per 128 MB file chunk

# After shuffle (groupBy, join, distinct):
# spark.sql.shuffle.partitions (default: 200)
spark.conf.get("spark.sql.shuffle.partitions")   # "200"

# Adaptive Query Execution (Spark 3.x) auto-coalesces shuffle partitions
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

Checking Partition Count

# RDD
rdd = sc.parallelize(range(100), numSlices=10)
print(rdd.getNumPartitions())   # 10

# DataFrame
df = spark.read.parquet("employees.parquet")
print(df.rdd.getNumPartitions())   # Depends on file size

# Partition sizes (approximate)
df.rdd.mapPartitionsWithIndex(
    lambda i, it: [(i, sum(1 for _ in it))]
).collect()
# [(0, 1234), (1, 1198), (2, 1301), ...]  — rows per partition

repartition — Increase or Change Partitions

repartition causes a full shuffle — all data moves across the network. Use it to increase parallelism or distribute data evenly.

# Repartition to a fixed count
df.repartition(200)

# Repartition by column (data with same key goes to the same partition)
df.repartition(200, F.col("department"))

# Multiple partition columns
df.repartition(400, F.col("country"), F.col("year"))

# RDD repartition
rdd.repartition(50)

# When to use repartition:
# - Before a costly operation like a join on a skewed dataset
# - When writing output with an optimal number of files
# - When increasing parallelism after reading small files

coalesce — Decrease Partitions (No Shuffle)

coalesce reduces the partition count by merging adjacent partitions without moving data across the network. It’s faster than repartition for reducing partitions.

# Reduce to fewer partitions (no shuffle)
df.coalesce(4)

# Write a single output file
df.coalesce(1).write.parquet("output/")   # Single file — useful for small outputs

# RDD coalesce
rdd.coalesce(2)

# When to use coalesce:
# - After filtering large amounts of data (many empty partitions remain)
# - Before writing to reduce the number of output files
# - When you have far more partitions than executor cores

Partition Sizing Guidelines

# Target: each partition should be ~128 MB when read from storage
# Target: 2-4 tasks per executor core for good CPU utilization

# Example: 10 GB dataset, 8 cores, target 128 MB partitions
# → 10 GB / 128 MB ≈ 80 partitions
# → 8 cores × 3 tasks each = 24 recommended concurrent tasks
# → Use 80 partitions (divisible by cores is fine, not required)

# Set for your entire session
spark.conf.set("spark.sql.shuffle.partitions", "80")

# Or set per-operation by repartitioning before groupBy
df.repartition(80).groupBy("category").sum("revenue")

Handling Data Skew

Skew means some partitions are much larger than others — a few executors run for hours while the rest finish in minutes.

# Detect skew
df.rdd.mapPartitionsWithIndex(
    lambda i, it: [(i, sum(1 for _ in it))]
).toDF(["partition", "row_count"]).orderBy(F.col("row_count").desc()).show(10)

# Fix with salting — for groupBy/join skew
import random

# Add a random salt to distribute skewed keys
df_salted = df.withColumn(
    "salted_key",
    F.concat(F.col("skewed_key"), F.lit("_"), (F.rand() * 10).cast("int").cast("string"))
)

# Aggregate in two steps
intermediate = df_salted.groupBy("salted_key").sum("value")
# Remove salt prefix and re-aggregate
final = intermediate.withColumn("original_key", F.split(F.col("salted_key"), "_")[0]) \
    .groupBy("original_key").sum("sum(value)")

# Fix with broadcast join — for join skew (small table on one side)
from pyspark.sql.functions import broadcast
df_large.join(broadcast(df_small), "key")

# AQE automatic skew join fix (Spark 3.x)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Custom Partitioner (RDD API)

from pyspark import Partitioner

class DepartmentPartitioner(Partitioner):
    def __init__(self, partitions):
        self.partitions_count = partitions

    def numPartitions(self):
        return self.partitions_count

    def getPartition(self, key):
        mapping = {"Engineering": 0, "Marketing": 1, "HR": 2}
        return mapping.get(key, self.partitions_count - 1)

pairs = sc.parallelize([("Engineering", 1), ("Marketing", 2), ("HR", 3)])
partitioned = pairs.partitionBy(3, DepartmentPartitioner(3))