Apache Spark Introduction

Apache Spark is an open-source, unified analytics engine for large-scale data processing. Built around in-memory computation, Spark executes workloads 10-100× faster than Hadoop MapReduce for iterative algorithms and interactive analytics. Originally developed at UC Berkeley’s AMPLab in 2009, it became an Apache top-level project in 2014 and is now the de facto standard for distributed data processing.

Why Spark Outperforms MapReduce

MapReduce writes intermediate results to HDFS after every map and reduce step — expensive disk I/O at every stage. Spark keeps intermediate data in memory and only writes to disk when necessary:

Feature	Hadoop MapReduce	Apache Spark
Processing	Disk-based (every step)	In-memory
Speed	Baseline	10-100× faster
Programming model	Map + Reduce only	RDD, DataFrame, SQL, Streaming, ML
Language support	Java	Python, Scala, Java, R, SQL
Fault tolerance	Data replication	Lineage-based recomputation
Interactive queries	❌	✅
Streaming	Batch only	Micro-batch + continuous

Core Components

Apache Spark Ecosystem
├── Spark Core          — RDDs, fault tolerance, scheduling
├── Spark SQL           — DataFrames, Datasets, SQL queries
├── Spark Streaming     — Micro-batch streaming (DStream)
├── Structured Streaming — Continuous streaming (modern)
├── MLlib               — Distributed machine learning
└── GraphX              — Graph processing

A First PySpark Program

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder \
    .appName("SparkIntro") \
    .getOrCreate()

# Read a file
df = spark.read.option("header", True).csv("sales.csv")

# Transform
result = df \
    .filter(F.col("region") == "APAC") \
    .groupBy("product_category") \
    .agg(F.sum("revenue").alias("total_revenue")) \
    .orderBy(F.col("total_revenue").desc())

result.show()
spark.stop()

Spark in 2025: Key Trends

Delta Lake / Iceberg / Hudi — open table formats for lakehouse architectures, replacing raw Parquet
Apache Spark 3.5 — improved Adaptive Query Execution (AQE), better Python UDF performance
Apache Spark Connect — thin client API; Spark logic executes server-side, decoupling client from cluster version
PySpark + Arrow — columnar data exchange between Python and JVM; pandas UDFs are now vectorized by default
Spark on Kubernetes — replacing YARN as the preferred cluster manager in cloud-native environments

When to Use Spark

Use Case	Spark?
Processing terabytes to petabytes	✅ Ideal
ETL pipelines and data transformation	✅ Ideal
Real-time streaming analytics	✅ Structured Streaming
Distributed ML training on large datasets	✅ MLlib or Spark + PyTorch
Simple analytics on < 1 GB	❌ Pandas is faster
OLTP / transactional workloads	❌ Use a relational database
Low-latency sub-second queries	❌ Use Presto/Trino