Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

SparkSession

SparkSession is the unified entry point for all Spark functionality introduced in Spark 2.0. It consolidates the older SparkContext, SQLContext, and HiveContext into a single object that manages your connection to the Spark cluster, data reading, SQL execution, and configuration.


Creating a SparkSession

from pyspark.sql import SparkSession
# Minimal — good for development
spark = SparkSession.builder \
.appName("MyPipeline") \
.getOrCreate()
# Production configuration
spark = SparkSession.builder \
.appName("DataPipeline") \
.master("yarn") \
.config("spark.executor.memory", "8g") \
.config("spark.executor.cores", "4") \
.config("spark.sql.shuffle.partitions", "200") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.enableHiveSupport() \
.getOrCreate()
print(spark.version) # "3.5.1"

getOrCreate() returns an existing session if one already exists in the JVM — safe to call multiple times across modules and notebook cells.


Reading Data

# CSV
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("s3://bucket/sales.csv")
# Parquet — schema embedded in file, no options needed
df = spark.read.parquet("s3://bucket/transactions/")
# JSON (multi-line format)
df = spark.read.option("multiLine", "true").json("events/*.json")
# Delta Lake (recommended for 2025 lakehouses)
df = spark.read.format("delta").load("s3://bucket/delta/employees/")
# Database via JDBC
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://host:5432/db") \
.option("dbtable", "public.orders") \
.option("user", "spark") \
.option("password", "secret") \
.option("numPartitions", "16") \
.load()

SQL Queries

# Register DataFrame as a SQL view
df.createOrReplaceTempView("sales")
result = spark.sql("""
SELECT
region,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM sales
WHERE year = 2025
GROUP BY region
ORDER BY total_revenue DESC
""")
result.show()
# Global view — accessible from other sessions
df.createOrReplaceGlobalTempView("global_sales")
spark.sql("SELECT COUNT(*) FROM global_temp.global_sales").show()

Runtime Configuration

# Get and set runtime config
spark.conf.get("spark.sql.shuffle.partitions") # "200"
spark.conf.set("spark.sql.shuffle.partitions", "400")
# Enable Adaptive Query Execution (auto-tunes shuffle partitions)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Accessing SparkContext

sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4, 5])
sc.getConf().getAll() # All Spark settings

Stopping the Session

try:
result = spark.read.parquet("data/").count()
print(f"Rows: {result}")
finally:
spark.stop() # Always stop in scripts; notebooks usually leave it running