Saving a PySpark DataFrame as CSV
DataFrame.write.csv() writes data to CSV files — one file per partition by default. Use write options to control headers, delimiters, compression, and the number of output files.
Basic CSV Write
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("WriteCSV").getOrCreate()
data = [("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000)]df = spark.createDataFrame(data, ["name", "department", "salary"])
# Write CSV with header — creates a directory with multiple part filesdf.write \ .mode("overwrite") \ .option("header", "true") \ .csv("output/employees/")Output directory structure:
output/employees/├── _SUCCESS├── part-00000-xxxx.csv└── part-00001-xxxx.csvWrite Modes
# overwrite — replace existing datadf.write.mode("overwrite").csv("output/")
# append — add to existing datadf.write.mode("append").csv("output/")
# ignore — no-op if directory existsdf.write.mode("ignore").csv("output/")
# error / errorIfExists (default) — raises error if directory existsdf.write.mode("error").csv("output/")CSV Write Options
df.write \ .mode("overwrite") \ .option("header", "true") \ # Write column names in first row .option("sep", "\t") \ # Tab-delimited (TSV) .option("quote", '"') \ # Quote character .option("escape", "\\") \ # Escape character .option("nullValue", "NULL") \ # String for null values .option("dateFormat", "yyyy-MM-dd") \ # Date formatting .option("timestampFormat","yyyy-MM-dd HH:mm:ss") \ .option("encoding", "UTF-8") \ .csv("output/report.csv")Compression
df.write \ .mode("overwrite") \ .option("header", "true") \ .option("compression", "gzip") \ # gzip, bzip2, lz4, snappy, deflate .csv("output/compressed/")
# Output: part-00000-xxxx.csv.gzWriting a Single File
By default, Spark writes one file per partition. For a single output file:
# coalesce(1) — merges partitions to driver, then writes single filedf.coalesce(1) \ .write \ .mode("overwrite") \ .option("header", "true") \ .csv("output/single-file/")
# ⚠️ For large datasets, avoid coalesce(1) — it's a driver bottleneck# Use it only when downstream systems require a single fileWriting to S3
# S3 write (AWS credentials must be configured)df.write \ .mode("overwrite") \ .option("header", "true") \ .option("compression", "snappy") \ .csv("s3://my-bucket/data/employees/")Partitioned CSV Output
# Partition by year and month for efficient downstream readsdf.withColumn("year", F.year(F.col("hire_date"))) \ .withColumn("month", F.month(F.col("hire_date"))) \ .write \ .mode("overwrite") \ .option("header", "true") \ .partitionBy("year", "month") \ .csv("output/partitioned/")
# Output:# output/partitioned/year=2024/month=1/part-00000.csv# output/partitioned/year=2025/month=3/part-00000.csvReading Back the Output
df_back = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .csv("output/employees/")df_back.show()