Apache Spark Introduction
Apache Spark is an open-source, unified analytics engine for large-scale data processing. Built around in-memory computation, Spark executes workloads 10-100× faster than Hadoop MapReduce for iterative algorithms and interactive analytics. Originally developed at UC Berkeley’s AMPLab in 2009, it became an Apache top-level project in 2014 and is now the de facto standard for distributed data processing.
Why Spark Outperforms MapReduce
MapReduce writes intermediate results to HDFS after every map and reduce step — expensive disk I/O at every stage. Spark keeps intermediate data in memory and only writes to disk when necessary:
| Feature | Hadoop MapReduce | Apache Spark |
|---|---|---|
| Processing | Disk-based (every step) | In-memory |
| Speed | Baseline | 10-100× faster |
| Programming model | Map + Reduce only | RDD, DataFrame, SQL, Streaming, ML |
| Language support | Java | Python, Scala, Java, R, SQL |
| Fault tolerance | Data replication | Lineage-based recomputation |
| Interactive queries | ❌ | ✅ |
| Streaming | Batch only | Micro-batch + continuous |
Core Components
Apache Spark Ecosystem├── Spark Core — RDDs, fault tolerance, scheduling├── Spark SQL — DataFrames, Datasets, SQL queries├── Spark Streaming — Micro-batch streaming (DStream)├── Structured Streaming — Continuous streaming (modern)├── MLlib — Distributed machine learning└── GraphX — Graph processingA First PySpark Program
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \ .appName("SparkIntro") \ .getOrCreate()
# Read a filedf = spark.read.option("header", True).csv("sales.csv")
# Transformresult = df \ .filter(F.col("region") == "APAC") \ .groupBy("product_category") \ .agg(F.sum("revenue").alias("total_revenue")) \ .orderBy(F.col("total_revenue").desc())
result.show()spark.stop()Spark in 2025: Key Trends
- Delta Lake / Iceberg / Hudi — open table formats for lakehouse architectures, replacing raw Parquet
- Apache Spark 3.5 — improved Adaptive Query Execution (AQE), better Python UDF performance
- Apache Spark Connect — thin client API; Spark logic executes server-side, decoupling client from cluster version
- PySpark + Arrow — columnar data exchange between Python and JVM; pandas UDFs are now vectorized by default
- Spark on Kubernetes — replacing YARN as the preferred cluster manager in cloud-native environments
When to Use Spark
| Use Case | Spark? |
|---|---|
| Processing terabytes to petabytes | ✅ Ideal |
| ETL pipelines and data transformation | ✅ Ideal |
| Real-time streaming analytics | ✅ Structured Streaming |
| Distributed ML training on large datasets | ✅ MLlib or Spark + PyTorch |
| Simple analytics on < 1 GB | ❌ Pandas is faster |
| OLTP / transactional workloads | ❌ Use a relational database |
| Low-latency sub-second queries | ❌ Use Presto/Trino |