Saving a PySpark DataFrame as CSV

DataFrame.write.csv() writes data to CSV files — one file per partition by default. Use write options to control headers, delimiters, compression, and the number of output files.

Basic CSV Write

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("WriteCSV").getOrCreate()

data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]
df = spark.createDataFrame(data, ["name", "department", "salary"])

# Write CSV with header — creates a directory with multiple part files
df.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("output/employees/")

Output directory structure:

output/employees/
├── _SUCCESS
├── part-00000-xxxx.csv
└── part-00001-xxxx.csv

Write Modes

# overwrite — replace existing data
df.write.mode("overwrite").csv("output/")

# append — add to existing data
df.write.mode("append").csv("output/")

# ignore — no-op if directory exists
df.write.mode("ignore").csv("output/")

# error / errorIfExists (default) — raises error if directory exists
df.write.mode("error").csv("output/")

CSV Write Options

df.write \
    .mode("overwrite") \
    .option("header",         "true") \      # Write column names in first row
    .option("sep",            "\t") \         # Tab-delimited (TSV)
    .option("quote",          '"') \          # Quote character
    .option("escape",         "\\") \         # Escape character
    .option("nullValue",      "NULL") \       # String for null values
    .option("dateFormat",     "yyyy-MM-dd") \ # Date formatting
    .option("timestampFormat","yyyy-MM-dd HH:mm:ss") \
    .option("encoding",       "UTF-8") \
    .csv("output/report.csv")

Compression

df.write \
    .mode("overwrite") \
    .option("header",      "true") \
    .option("compression", "gzip") \   # gzip, bzip2, lz4, snappy, deflate
    .csv("output/compressed/")

# Output: part-00000-xxxx.csv.gz

Writing a Single File

By default, Spark writes one file per partition. For a single output file:

# coalesce(1) — merges partitions to driver, then writes single file
df.coalesce(1) \
    .write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv("output/single-file/")

# ⚠️ For large datasets, avoid coalesce(1) — it's a driver bottleneck
# Use it only when downstream systems require a single file

Writing to S3

# S3 write (AWS credentials must be configured)
df.write \
    .mode("overwrite") \
    .option("header", "true") \
    .option("compression", "snappy") \
    .csv("s3://my-bucket/data/employees/")

Partitioned CSV Output

# Partition by year and month for efficient downstream reads
df.withColumn("year",  F.year(F.col("hire_date"))) \
  .withColumn("month", F.month(F.col("hire_date"))) \
  .write \
  .mode("overwrite") \
  .option("header", "true") \
  .partitionBy("year", "month") \
  .csv("output/partitioned/")

# Output:
# output/partitioned/year=2024/month=1/part-00000.csv
# output/partitioned/year=2025/month=3/part-00000.csv

Reading Back the Output

df_back = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("output/employees/")
df_back.show()