Creating a PySpark DataFrame from Collections

Creating DataFrames from in-memory Python collections is the foundation of PySpark unit testing and quick prototyping. It lets you build and verify your transformation logic before connecting to production data sources.

From a List of Tuples

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType

spark = SparkSession.builder.appName("CollectionDemo").getOrCreate()

# Tuples with column names inferred from position
data = [
    ("Alice", "Engineering", 95000),
    ("Bob",   "Marketing",   72000),
    ("Carol", "Engineering", 110000),
]
df = spark.createDataFrame(data, ["name", "department", "salary"])
df.show()
# +-----+------------+------+
# | name|  department|salary|
# +-----+------------+------+
# |Alice| Engineering| 95000|
# |  Bob|   Marketing| 72000|
# |Carol| Engineering|110000|
# +-----+------------+------+

With an Explicit Schema

Always provide an explicit schema in production code — it’s faster and avoids type inference surprises:

schema = StructType([
    StructField("name",       StringType(),  nullable=False),
    StructField("department", StringType(),  nullable=True),
    StructField("salary",     IntegerType(), nullable=True),
])

df = spark.createDataFrame(data, schema)
df.printSchema()
# root
#  |-- name: string (nullable = false)
#  |-- department: string (nullable = true)
#  |-- salary: integer (nullable = true)

From a List of Dictionaries

records = [
    {"product": "Laptop",  "price": 1299.99, "in_stock": True,  "quantity": 50},
    {"product": "Mouse",   "price": 29.99,   "in_stock": True,  "quantity": 200},
    {"product": "Monitor", "price": 399.99,  "in_stock": False, "quantity": 0},
]

df = spark.createDataFrame(records)
df.printSchema()
# root
#  |-- in_stock: boolean (nullable = true)
#  |-- price: double (nullable = true)
#  |-- product: string (nullable = true)
#  |-- quantity: long (nullable = true)

From Row Objects

Row allows named field access and is useful when working with heterogeneous data:

from pyspark.sql import Row

employees = [
    Row(name="Alice", department="Engineering", salary=95000, active=True),
    Row(name="Bob",   department="Marketing",   salary=72000, active=False),
]

df = spark.createDataFrame(employees)
df.show()
df.filter(df.active == True).select("name", "salary").show()

Handling Nulls in Collections

data_with_nulls = [
    ("Alice", "Engineering", 95000),
    ("Bob",   None,           72000),   # None → null in Spark
    ("Carol", "Engineering", None),     # None → null for salary
]

schema = StructType([
    StructField("name",       StringType(),  nullable=False),
    StructField("department", StringType(),  nullable=True),
    StructField("salary",     IntegerType(), nullable=True),
])

df = spark.createDataFrame(data_with_nulls, schema)
df.show()
df.filter(df.department.isNull()).show()
df.filter(df.salary.isNotNull()).show()

Nested Structures

from pyspark.sql.types import ArrayType, MapType

nested_schema = StructType([
    StructField("name",   StringType()),
    StructField("scores", ArrayType(IntegerType())),
    StructField("tags",   MapType(StringType(), StringType())),
])

nested_data = [
    ("Alice", [90, 85, 92], {"level": "senior", "team": "backend"}),
    ("Bob",   [75, 80],     {"level": "junior", "team": "frontend"}),
]

df = spark.createDataFrame(nested_data, nested_schema)
df.show(truncate=False)
df.select("name", df.scores[0].alias("first_score")).show()

Quick Reference

Source	Method	Best For
List of tuples	`createDataFrame(data, col_names)`	Simple prototyping
Explicit schema	`createDataFrame(data, schema)`	Production, type safety
List of dicts	`createDataFrame(records)`	Irregular structures
Row objects	`createDataFrame(rows)`	Named field access
Pandas DataFrame	`createDataFrame(pdf)`	Data science workflows