Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Saving a PySpark DataFrame as Parquet

Parquet is the recommended format for Spark output in 2025. It’s columnar (enables predicate and projection pushdown), self-describing (schema is embedded), and compressed by default. Most lakehouse architectures build on Parquet or Parquet-based table formats like Delta Lake.


Basic Parquet Write

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("WriteParquet").getOrCreate()
data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]
df = spark.createDataFrame(data, ["name", "department", "salary"])
# Write Parquet
df.write.mode("overwrite").parquet("output/employees/")

Write Modes

df.write.mode("overwrite").parquet("output/") # Replace all existing data
df.write.mode("append").parquet("output/") # Add to existing data
df.write.mode("ignore").parquet("output/") # No-op if path exists
df.write.mode("error").parquet("output/") # Default — error if path exists

Compression Codecs

# Snappy — default, fast decompression, moderate compression
df.write.option("compression", "snappy").parquet("output/")
# Gzip — better compression ratio, slower decompression
df.write.option("compression", "gzip").parquet("output/")
# Zstandard — best balance of ratio and speed (recommended 2025)
df.write.option("compression", "zstd").parquet("output/")
# No compression — for fastest sequential reads
df.write.option("compression", "none").parquet("output/")

Partitioning

Partitioning by commonly-filtered columns dramatically improves query performance by enabling partition pruning (Spark skips irrelevant partitions):

# Partition by year and region
df.write \
.mode("overwrite") \
.partitionBy("year", "region") \
.parquet("s3://bucket/sales/")
# Output structure:
# s3://bucket/sales/year=2024/region=APAC/part-00000.snappy.parquet
# s3://bucket/sales/year=2025/region=EMEA/part-00000.snappy.parquet
# Query with partition pruning — Spark only reads relevant directories
spark.read \
.parquet("s3://bucket/sales/") \
.filter("year = 2025 AND region = 'APAC'") \
.show()

Controlling Output File Size

# Repartition before writing for even output file sizes
df.repartition(50).write.mode("overwrite").parquet("output/")
# Single file output (for small datasets)
df.coalesce(1).write.mode("overwrite").parquet("output/single/")
# Target file size with AQE (Spark 3.x automatic optimization)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.files.maxPartitionBytes", "134217728") # 128 MB

Delta Lake builds on Parquet and adds ACID transactions, schema enforcement, and time travel:

# Write Delta table
df.write.format("delta").mode("overwrite").save("s3://bucket/delta/employees/")
# Append
df.write.format("delta").mode("append").save("s3://bucket/delta/employees/")
# Read back
delta_df = spark.read.format("delta").load("s3://bucket/delta/employees/")
# Time travel — read historical snapshot
spark.read \
.format("delta") \
.option("versionAsOf", "3") \
.load("s3://bucket/delta/employees/") \
.show()
# Upsert (merge)
from delta.tables import DeltaTable
target = DeltaTable.forPath(spark, "s3://bucket/delta/employees/")
target.alias("t").merge(
df.alias("s"),
"t.employee_id = s.employee_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Schema Evolution

# Allow adding new columns without rewriting the entire table
df_with_new_col.write \
.format("delta") \
.option("mergeSchema", "true") \
.mode("append") \
.save("s3://bucket/delta/employees/")