Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark: union() vs unionAll() vs unionByName()

Spark provides three ways to stack DataFrames vertically. They differ in how columns are matched (by position or name) and whether duplicates are removed.


Quick Reference

MethodColumn matchingDeduplicationSpark version
union()By positionNone (keeps all)All versions
unionAll()By positionNone (keeps all)Deprecated in 2.0+
unionByName()By nameNone (keeps all)2.3+
SQL UNIONBy name (view)Yes — removes duplicatesAll versions
SQL UNION ALLBy name (view)None (keeps all)All versions

union() — Position-Based

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Unions").getOrCreate()
df1 = spark.createDataFrame([(1, "Alice", 95000)], ["id", "name", "salary"])
df2 = spark.createDataFrame([(2, "Bob", 72000)], ["id", "name", "salary"])
df3 = spark.createDataFrame([(3, "Carol", 95000)], ["id", "name", "salary"])
# Stack all three — column 1 to column 1, column 2 to column 2, etc.
stacked = df1.union(df2).union(df3)
stacked.show()
# +---+-----+------+
# | id| name|salary|
# +---+-----+------+
# | 1|Alice| 95000|
# | 2| Bob| 72000|
# | 3|Carol| 95000|
# +---+-----+------+
# DOES NOT deduplicate — even identical rows are kept
df1.union(df1).count() # 2 (not 1)

unionAll() — Deprecated Alias

# In Spark 2.0+, unionAll() is an alias for union() and deprecated
# These are equivalent:
df1.union(df2)
df1.unionAll(df2) # Deprecated — avoid in new code

unionByName() — Name-Based Matching

# Columns matched by name — order doesn't matter
df_a = spark.createDataFrame([(1, "Alice", 95000)], ["id", "name", "salary"])
df_b = spark.createDataFrame([(2, 72000, "Bob")], ["id", "salary", "name"]) # Different order
# union() fails silently — wrong column alignment
df_a.union(df_b).show() # salary and name mixed up!
# unionByName() aligns by name
df_a.unionByName(df_b).show() # Correct result
# Spark 3.1+: fill missing columns with null
df_a.unionByName(df_b, allowMissingColumns=True)

SQL UNION vs UNION ALL

df1.createOrReplaceTempView("t1")
df2.createOrReplaceTempView("t2")
# UNION — removes duplicates (like distinct on the combined set)
spark.sql("SELECT * FROM t1 UNION SELECT * FROM t2").show()
# UNION ALL — keeps all rows including duplicates (fastest)
spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").show()

Choosing the Right Method

# Same column order, no dedup needed → union()
monthly_reports = jan_df.union(feb_df).union(mar_df)
# Different column order, no dedup needed → unionByName()
from functools import reduce
combined = reduce(lambda a, b: a.unionByName(b), [jan_df, feb_df, mar_df])
# Need deduplication → use distinct() after union
combined.distinct()
# Or use SQL UNION
combined_sql = spark.sql("SELECT * FROM jan UNION SELECT * FROM feb")

Performance