Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

SparkContext

SparkContext (sc) was the original entry point into a Spark cluster, introduced in Spark 1.x for RDD-based programming. Since Spark 2.0, SparkSession replaced it as the recommended entry point — but SparkContext still exists underneath every SparkSession and remains the gateway to low-level RDD operations.


SparkContext in the Modern API

from pyspark.sql import SparkSession
# Create SparkSession (the modern way)
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
# SparkContext is embedded inside SparkSession
sc = spark.sparkContext
# Verify
print(type(sc)) # <class 'pyspark.context.SparkContext'>
print(sc.master) # "local[*]" or "yarn" etc.
print(sc.appName) # "MyApp"
print(sc.version) # "3.5.1"

In Spark 2.0+, there should only be one SparkContext per JVM. Creating a second one raises an error.


Creating RDDs with SparkContext

# parallelize — from a Python collection
rdd1 = sc.parallelize([1, 2, 3, 4, 5])
rdd2 = sc.parallelize(range(1000), numSlices=8) # 8 partitions
# textFile — read text files line by line
logs = sc.textFile("s3://bucket/logs/2025-06-*.log")
logs = sc.textFile("hdfs://data/input.txt", minPartitions=4)
# wholeTextFiles — returns (filename, content) pairs
files = sc.wholeTextFiles("s3://bucket/configs/")
# sequenceFile — Hadoop sequence files
seq_rdd = sc.sequenceFile("hdfs://data/sequence/")
# binaryFiles — binary files as (path, bytes) pairs
binary = sc.binaryFiles("s3://bucket/images/")
# emptyRDD
empty = sc.emptyRDD()
# union of multiple RDDs
a = sc.parallelize([1, 2, 3])
b = sc.parallelize([4, 5, 6])
united = sc.union([a, b])

SparkContext Configuration

from pyspark import SparkContext, SparkConf
# Building a standalone SparkContext (old style — use SparkSession instead)
conf = SparkConf() \
.setAppName("LegacyApp") \
.setMaster("local[*]") \
.set("spark.executor.memory", "4g") \
.set("spark.rdd.compress", "true")
sc = SparkContext(conf=conf)
# Reading config
sc.getConf().get("spark.executor.memory") # "4g"
sc.getConf().getAll() # All key-value pairs

Parallelism Control

# Default parallelism — usually number of CPU cores × 2
print(sc.defaultParallelism) # e.g., 8 on a 4-core machine
# Set parallelism explicitly
rdd = sc.parallelize(range(10000), numSlices=100)
print(rdd.getNumPartitions()) # 100
# Check partitions
sc.defaultMinPartitions # Minimum partitions for text files

Broadcast and Accumulator from SparkContext

# Broadcast a read-only variable to all executors
lookup_table = {"apple": 1, "banana": 2, "cherry": 3}
broadcast_table = sc.broadcast(lookup_table)
result = sc.parallelize(["apple", "banana", "cherry"]) \
.map(lambda x: (x, broadcast_table.value.get(x, 0)))
# [("apple", 1), ("banana", 2), ("cherry", 3)]
# Free the broadcast variable
broadcast_table.unpersist()
broadcast_table.destroy()
# Accumulator — write-only counter accessible from executors
error_count = sc.accumulator(0)
def process(record):
global error_count
if "ERROR" in record:
error_count += 1
return record
sc.textFile("logs.txt").foreach(process)
print(f"Errors found: {error_count.value}")

Spark UI Access

# SparkContext exposes the Spark UI URL
print(sc.uiWebUrl) # "http://localhost:4040"
# Also accessible via SparkSession
print(spark.sparkContext.uiWebUrl)

Stopping SparkContext

# Stop the SparkContext (and the underlying Spark application)
sc.stop()
# Equivalently via SparkSession:
spark.stop() # This also stops the SparkContext
# Check if stopped
sc.isStopped # True after stop()

SparkContext vs SparkSession in 2025: use SparkSession for all new code. Access sc through it only when you need raw RDD APIs.