Spark: union() vs unionAll() vs unionByName()
Spark provides three ways to stack DataFrames vertically. They differ in how columns are matched (by position or name) and whether duplicates are removed.
Quick Reference
| Method | Column matching | Deduplication | Spark version |
|---|---|---|---|
union() | By position | None (keeps all) | All versions |
unionAll() | By position | None (keeps all) | Deprecated in 2.0+ |
unionByName() | By name | None (keeps all) | 2.3+ |
SQL UNION | By name (view) | Yes — removes duplicates | All versions |
SQL UNION ALL | By name (view) | None (keeps all) | All versions |
union() — Position-Based
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Unions").getOrCreate()
df1 = spark.createDataFrame([(1, "Alice", 95000)], ["id", "name", "salary"])df2 = spark.createDataFrame([(2, "Bob", 72000)], ["id", "name", "salary"])df3 = spark.createDataFrame([(3, "Carol", 95000)], ["id", "name", "salary"])
# Stack all three — column 1 to column 1, column 2 to column 2, etc.stacked = df1.union(df2).union(df3)stacked.show()# +---+-----+------+# | id| name|salary|# +---+-----+------+# | 1|Alice| 95000|# | 2| Bob| 72000|# | 3|Carol| 95000|# +---+-----+------+
# DOES NOT deduplicate — even identical rows are keptdf1.union(df1).count() # 2 (not 1)unionAll() — Deprecated Alias
# In Spark 2.0+, unionAll() is an alias for union() and deprecated# These are equivalent:df1.union(df2)df1.unionAll(df2) # Deprecated — avoid in new codeunionByName() — Name-Based Matching
# Columns matched by name — order doesn't matterdf_a = spark.createDataFrame([(1, "Alice", 95000)], ["id", "name", "salary"])df_b = spark.createDataFrame([(2, 72000, "Bob")], ["id", "salary", "name"]) # Different order
# union() fails silently — wrong column alignmentdf_a.union(df_b).show() # salary and name mixed up!
# unionByName() aligns by namedf_a.unionByName(df_b).show() # Correct result
# Spark 3.1+: fill missing columns with nulldf_a.unionByName(df_b, allowMissingColumns=True)SQL UNION vs UNION ALL
df1.createOrReplaceTempView("t1")df2.createOrReplaceTempView("t2")
# UNION — removes duplicates (like distinct on the combined set)spark.sql("SELECT * FROM t1 UNION SELECT * FROM t2").show()
# UNION ALL — keeps all rows including duplicates (fastest)spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").show()Choosing the Right Method
# Same column order, no dedup needed → union()monthly_reports = jan_df.union(feb_df).union(mar_df)
# Different column order, no dedup needed → unionByName()from functools import reducecombined = reduce(lambda a, b: a.unionByName(b), [jan_df, feb_df, mar_df])
# Need deduplication → use distinct() after unioncombined.distinct()
# Or use SQL UNIONcombined_sql = spark.sql("SELECT * FROM jan UNION SELECT * FROM feb")Performance
union()/unionByName()are lazy narrow transformations — no shuffle- Adding
.distinct()after union causes a shuffle (wide transformation) - For many DataFrames:
reduce(lambda a, b: a.union(b), dfs)works but can build a deep lineage; considerspark.read.parquet([path1, path2, ...])instead when reading from files