SparkSession

SparkSession is the unified entry point for all Spark functionality introduced in Spark 2.0. It consolidates the older SparkContext, SQLContext, and HiveContext into a single object that manages your connection to the Spark cluster, data reading, SQL execution, and configuration.

Creating a SparkSession

from pyspark.sql import SparkSession

# Minimal — good for development
spark = SparkSession.builder \
    .appName("MyPipeline") \
    .getOrCreate()

# Production configuration
spark = SparkSession.builder \
    .appName("DataPipeline") \
    .master("yarn") \
    .config("spark.executor.memory",          "8g") \
    .config("spark.executor.cores",           "4") \
    .config("spark.sql.shuffle.partitions",   "200") \
    .config("spark.sql.adaptive.enabled",     "true") \
    .config("spark.serializer",               "org.apache.spark.serializer.KryoSerializer") \
    .enableHiveSupport() \
    .getOrCreate()

print(spark.version)   # "3.5.1"

getOrCreate() returns an existing session if one already exists in the JVM — safe to call multiple times across modules and notebook cells.

Reading Data

# CSV
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("s3://bucket/sales.csv")

# Parquet — schema embedded in file, no options needed
df = spark.read.parquet("s3://bucket/transactions/")

# JSON (multi-line format)
df = spark.read.option("multiLine", "true").json("events/*.json")

# Delta Lake (recommended for 2025 lakehouses)
df = spark.read.format("delta").load("s3://bucket/delta/employees/")

# Database via JDBC
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:5432/db") \
    .option("dbtable", "public.orders") \
    .option("user", "spark") \
    .option("password", "secret") \
    .option("numPartitions", "16") \
    .load()

SQL Queries

# Register DataFrame as a SQL view
df.createOrReplaceTempView("sales")

result = spark.sql("""
    SELECT
        region,
        SUM(revenue) AS total_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM sales
    WHERE year = 2025
    GROUP BY region
    ORDER BY total_revenue DESC
""")
result.show()

# Global view — accessible from other sessions
df.createOrReplaceGlobalTempView("global_sales")
spark.sql("SELECT COUNT(*) FROM global_temp.global_sales").show()

Runtime Configuration

# Get and set runtime config
spark.conf.get("spark.sql.shuffle.partitions")        # "200"
spark.conf.set("spark.sql.shuffle.partitions", "400")

# Enable Adaptive Query Execution (auto-tunes shuffle partitions)
spark.conf.set("spark.sql.adaptive.enabled",                     "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",  "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled",            "true")

Accessing SparkContext

sc = spark.sparkContext
rdd = sc.parallelize([1, 2, 3, 4, 5])
sc.getConf().getAll()   # All Spark settings

Stopping the Session

try:
    result = spark.read.parquet("data/").count()
    print(f"Rows: {result}")
finally:
    spark.stop()   # Always stop in scripts; notebooks usually leave it running