SparkContext
SparkContext (sc) was the original entry point into a Spark cluster, introduced in Spark 1.x for RDD-based programming. Since Spark 2.0, SparkSession replaced it as the recommended entry point — but SparkContext still exists underneath every SparkSession and remains the gateway to low-level RDD operations.
SparkContext in the Modern API
from pyspark.sql import SparkSession
# Create SparkSession (the modern way)spark = SparkSession.builder \ .appName("MyApp") \ .getOrCreate()
# SparkContext is embedded inside SparkSessionsc = spark.sparkContext
# Verifyprint(type(sc)) # <class 'pyspark.context.SparkContext'>print(sc.master) # "local[*]" or "yarn" etc.print(sc.appName) # "MyApp"print(sc.version) # "3.5.1"In Spark 2.0+, there should only be one SparkContext per JVM. Creating a second one raises an error.
Creating RDDs with SparkContext
# parallelize — from a Python collectionrdd1 = sc.parallelize([1, 2, 3, 4, 5])rdd2 = sc.parallelize(range(1000), numSlices=8) # 8 partitions
# textFile — read text files line by linelogs = sc.textFile("s3://bucket/logs/2025-06-*.log")logs = sc.textFile("hdfs://data/input.txt", minPartitions=4)
# wholeTextFiles — returns (filename, content) pairsfiles = sc.wholeTextFiles("s3://bucket/configs/")
# sequenceFile — Hadoop sequence filesseq_rdd = sc.sequenceFile("hdfs://data/sequence/")
# binaryFiles — binary files as (path, bytes) pairsbinary = sc.binaryFiles("s3://bucket/images/")
# emptyRDDempty = sc.emptyRDD()
# union of multiple RDDsa = sc.parallelize([1, 2, 3])b = sc.parallelize([4, 5, 6])united = sc.union([a, b])SparkContext Configuration
from pyspark import SparkContext, SparkConf
# Building a standalone SparkContext (old style — use SparkSession instead)conf = SparkConf() \ .setAppName("LegacyApp") \ .setMaster("local[*]") \ .set("spark.executor.memory", "4g") \ .set("spark.rdd.compress", "true")
sc = SparkContext(conf=conf)
# Reading configsc.getConf().get("spark.executor.memory") # "4g"sc.getConf().getAll() # All key-value pairsParallelism Control
# Default parallelism — usually number of CPU cores × 2print(sc.defaultParallelism) # e.g., 8 on a 4-core machine
# Set parallelism explicitlyrdd = sc.parallelize(range(10000), numSlices=100)print(rdd.getNumPartitions()) # 100
# Check partitionssc.defaultMinPartitions # Minimum partitions for text filesBroadcast and Accumulator from SparkContext
# Broadcast a read-only variable to all executorslookup_table = {"apple": 1, "banana": 2, "cherry": 3}broadcast_table = sc.broadcast(lookup_table)
result = sc.parallelize(["apple", "banana", "cherry"]) \ .map(lambda x: (x, broadcast_table.value.get(x, 0)))# [("apple", 1), ("banana", 2), ("cherry", 3)]
# Free the broadcast variablebroadcast_table.unpersist()broadcast_table.destroy()
# Accumulator — write-only counter accessible from executorserror_count = sc.accumulator(0)
def process(record): global error_count if "ERROR" in record: error_count += 1 return record
sc.textFile("logs.txt").foreach(process)print(f"Errors found: {error_count.value}")Spark UI Access
# SparkContext exposes the Spark UI URLprint(sc.uiWebUrl) # "http://localhost:4040"
# Also accessible via SparkSessionprint(spark.sparkContext.uiWebUrl)Stopping SparkContext
# Stop the SparkContext (and the underlying Spark application)sc.stop()
# Equivalently via SparkSession:spark.stop() # This also stops the SparkContext
# Check if stoppedsc.isStopped # True after stop()SparkContext vs SparkSession in 2025: use SparkSession for all new code. Access sc through it only when you need raw RDD APIs.