Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Creating a PySpark DataFrame from Collections

Creating DataFrames from in-memory Python collections is the foundation of PySpark unit testing and quick prototyping. It lets you build and verify your transformation logic before connecting to production data sources.


From a List of Tuples

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
spark = SparkSession.builder.appName("CollectionDemo").getOrCreate()
# Tuples with column names inferred from position
data = [
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Carol", "Engineering", 110000),
]
df = spark.createDataFrame(data, ["name", "department", "salary"])
df.show()
# +-----+------------+------+
# | name| department|salary|
# +-----+------------+------+
# |Alice| Engineering| 95000|
# | Bob| Marketing| 72000|
# |Carol| Engineering|110000|
# +-----+------------+------+

With an Explicit Schema

Always provide an explicit schema in production code — it’s faster and avoids type inference surprises:

schema = StructType([
StructField("name", StringType(), nullable=False),
StructField("department", StringType(), nullable=True),
StructField("salary", IntegerType(), nullable=True),
])
df = spark.createDataFrame(data, schema)
df.printSchema()
# root
# |-- name: string (nullable = false)
# |-- department: string (nullable = true)
# |-- salary: integer (nullable = true)

From a List of Dictionaries

records = [
{"product": "Laptop", "price": 1299.99, "in_stock": True, "quantity": 50},
{"product": "Mouse", "price": 29.99, "in_stock": True, "quantity": 200},
{"product": "Monitor", "price": 399.99, "in_stock": False, "quantity": 0},
]
df = spark.createDataFrame(records)
df.printSchema()
# root
# |-- in_stock: boolean (nullable = true)
# |-- price: double (nullable = true)
# |-- product: string (nullable = true)
# |-- quantity: long (nullable = true)

From Row Objects

Row allows named field access and is useful when working with heterogeneous data:

from pyspark.sql import Row
employees = [
Row(name="Alice", department="Engineering", salary=95000, active=True),
Row(name="Bob", department="Marketing", salary=72000, active=False),
]
df = spark.createDataFrame(employees)
df.show()
df.filter(df.active == True).select("name", "salary").show()

Handling Nulls in Collections

data_with_nulls = [
("Alice", "Engineering", 95000),
("Bob", None, 72000), # None → null in Spark
("Carol", "Engineering", None), # None → null for salary
]
schema = StructType([
StructField("name", StringType(), nullable=False),
StructField("department", StringType(), nullable=True),
StructField("salary", IntegerType(), nullable=True),
])
df = spark.createDataFrame(data_with_nulls, schema)
df.show()
df.filter(df.department.isNull()).show()
df.filter(df.salary.isNotNull()).show()

Nested Structures

from pyspark.sql.types import ArrayType, MapType
nested_schema = StructType([
StructField("name", StringType()),
StructField("scores", ArrayType(IntegerType())),
StructField("tags", MapType(StringType(), StringType())),
])
nested_data = [
("Alice", [90, 85, 92], {"level": "senior", "team": "backend"}),
("Bob", [75, 80], {"level": "junior", "team": "frontend"}),
]
df = spark.createDataFrame(nested_data, nested_schema)
df.show(truncate=False)
df.select("name", df.scores[0].alias("first_score")).show()

Quick Reference

SourceMethodBest For
List of tuplescreateDataFrame(data, col_names)Simple prototyping
Explicit schemacreateDataFrame(data, schema)Production, type safety
List of dictscreateDataFrame(records)Irregular structures
Row objectscreateDataFrame(rows)Named field access
Pandas DataFramecreateDataFrame(pdf)Data science workflows