Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Spark Word Count

Word count is the canonical “hello world” program for distributed data processing. It demonstrates Spark’s core concepts: reading data, applying transformations, aggregating with keys, and producing output.


RDD API Word Count

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()
sc = spark.sparkContext
# Read input
lines = sc.textFile("s3://bucket/books/*.txt")
# Transform
word_counts = lines \
.flatMap(lambda line: line.lower().split()) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.sortBy(lambda x: x[1], ascending=False)
# Output
top_words = word_counts.take(20)
for word, count in top_words:
print(f"{word:20s} {count}")
word_counts.saveAsTextFile("s3://bucket/output/wordcount/")

DataFrame API Word Count

from pyspark.sql import functions as F
lines = spark.read.text("s3://bucket/books/*.txt") # Column named "value"
word_counts = lines \
.select(F.explode(F.split(F.lower(F.col("value")), r"\W+")).alias("word")) \
.filter(F.col("word") != "") \
.groupBy("word") \
.count() \
.orderBy(F.col("count").desc())
word_counts.show(20)

SQL API Word Count

lines = spark.read.text("books.txt")
lines.createOrReplaceTempView("lines")
spark.sql("""
SELECT word, COUNT(*) AS count
FROM (
SELECT EXPLODE(SPLIT(LOWER(value), '\\\W+')) AS word
FROM lines
)
WHERE word != ''
GROUP BY word
ORDER BY count DESC
LIMIT 20
""").show()

Cleaning Up Words

Real text needs cleaning — punctuation, stopwords, and empty strings pollute results:

import re
stopwords = {"the", "a", "an", "is", "are", "was", "were", "in", "on", "at", "of", "to"}
bc_stopwords = sc.broadcast(stopwords)
cleaned_counts = sc.textFile("books.txt") \
.flatMap(lambda line: re.split(r"\W+", line.lower())) \
.filter(lambda w: w and len(w) > 2 and w not in bc_stopwords.value) \
.map(lambda w: (w, 1)) \
.reduceByKey(lambda a, b: a + b) \
.filter(lambda x: x[1] >= 5) \
.sortBy(lambda x: x[1], ascending=False)
cleaned_counts.take(20)

Word Count Concepts Illustrated

This simple program demonstrates:

ConceptWhere in Word Count
TransformationflatMap, map, filter
Actiontake(), saveAsTextFile()
Lazy evaluationNone of the transforms run until take()
ShufflereduceByKey — groups by word across partitions
BroadcastStopwords list sent once to all executors
Lineagecounts ← reduceByKey ← map ← flatMap ← textFile