Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Apache Spark Datasets

A Dataset is a strongly-typed distributed collection that combines the best of both RDDs and DataFrames. It provides compile-time type safety (catching errors before runtime) while still benefiting from Spark’s Catalyst optimizer and Tungsten execution engine. Datasets are only available in Scala and Java — Python users work exclusively with DataFrames.


The Spark API Hierarchy

RDD (low-level, no optimization, any JVM type)
DataFrame (optimized, but Row type — no compile-time type safety)
Dataset[T] (optimized + compile-time type safety via encoders)

A DataFrame in Spark is literally a Dataset[Row].


Creating Datasets (Scala)

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().appName("Dataset Demo").getOrCreate()
import spark.implicits._ // Required for encoders
// Define a case class (the schema)
case class Employee(name: String, department: String, salary: Double)
case class Department(name: String, location: String, budget: Double)
// From a sequence
val employees = Seq(
Employee("Alice", "Engineering", 95000),
Employee("Bob", "Marketing", 72000),
Employee("Carol", "Engineering", 110000),
Employee("Dave", "HR", 65000)
).toDS()
// From a DataFrame (typed conversion)
val df = spark.read.json("employees.json")
val typedDs = df.as[Employee]
// From a file with schema inference
val ds = spark.read
.option("header", "true")
.csv("employees.csv")
.as[Employee]

Type-Safe Operations

// map — operates on Employee objects (not Row)
val salaryBumped = employees.map(e =>
e.copy(salary = e.salary * 1.1)
)
// filter — compile-time field access
val highEarners = employees.filter(_.salary > 80000)
// Case class fields are accessible by name
val names = employees.map(_.name)
// Compile error caught at build time:
// employees.map(_.nonExistentField) // ERROR: value nonExistentField is not a member of Employee
// groupByKey — Dataset equivalent of groupBy
val byDept = employees.groupByKey(_.department)
val avgSalary = byDept.agg(avg($"salary").as[Double])

Encoders

Encoders are the bridge between JVM objects and Spark’s internal binary format (Tungsten). They serialize/deserialize case classes efficiently — faster than Java or Kryo serialization.

import org.apache.spark.sql.Encoders
// Spark auto-derives encoders for case classes
val employeeEncoder = Encoders.product[Employee]
// For primitive types
val stringEncoder = Encoders.STRING
val intEncoder = Encoders.scalaInt
// Explicit encoder (usually implicit via import spark.implicits._)
val ds = spark.createDataset(Seq(Employee("Alice", "Eng", 95000)))(employeeEncoder)

Converting Between Dataset, DataFrame, and RDD

// Dataset → DataFrame (lose type safety)
val df: DataFrame = employees.toDF()
// DataFrame → Dataset (gain type safety — may fail at runtime if schema mismatches)
val ds: Dataset[Employee] = df.as[Employee]
// Dataset → RDD (lose Catalyst optimization)
val rdd: RDD[Employee] = employees.rdd
// RDD → Dataset (gain optimization)
val back: Dataset[Employee] = rdd.toDS()

Dataset vs DataFrame vs RDD

FeatureRDDDataFrameDataset
Compile-time type safety❌ (Row)
Catalyst optimization
Tungsten memory mgmt
Python support
Lambda functionsLimited
Schema enforcementManual
SerializationJava/KryoInternal formatEncoders

When to Use Datasets (2025)

Datasets are ideal when:

In practice, most PySpark teams use DataFrames exclusively. Scala teams often use Datasets for business-critical pipelines and DataFrames for ad-hoc analysis.