Spark Word Count
Word count is the canonical “hello world” program for distributed data processing. It demonstrates Spark’s core concepts: reading data, applying transformations, aggregating with keys, and producing output.
RDD API Word Count
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCount").getOrCreate()sc = spark.sparkContext
# Read inputlines = sc.textFile("s3://bucket/books/*.txt")
# Transformword_counts = lines \ .flatMap(lambda line: line.lower().split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .sortBy(lambda x: x[1], ascending=False)
# Outputtop_words = word_counts.take(20)for word, count in top_words: print(f"{word:20s} {count}")
word_counts.saveAsTextFile("s3://bucket/output/wordcount/")DataFrame API Word Count
from pyspark.sql import functions as F
lines = spark.read.text("s3://bucket/books/*.txt") # Column named "value"
word_counts = lines \ .select(F.explode(F.split(F.lower(F.col("value")), r"\W+")).alias("word")) \ .filter(F.col("word") != "") \ .groupBy("word") \ .count() \ .orderBy(F.col("count").desc())
word_counts.show(20)SQL API Word Count
lines = spark.read.text("books.txt")lines.createOrReplaceTempView("lines")
spark.sql(""" SELECT word, COUNT(*) AS count FROM ( SELECT EXPLODE(SPLIT(LOWER(value), '\\\W+')) AS word FROM lines ) WHERE word != '' GROUP BY word ORDER BY count DESC LIMIT 20""").show()Cleaning Up Words
Real text needs cleaning — punctuation, stopwords, and empty strings pollute results:
import re
stopwords = {"the", "a", "an", "is", "are", "was", "were", "in", "on", "at", "of", "to"}bc_stopwords = sc.broadcast(stopwords)
cleaned_counts = sc.textFile("books.txt") \ .flatMap(lambda line: re.split(r"\W+", line.lower())) \ .filter(lambda w: w and len(w) > 2 and w not in bc_stopwords.value) \ .map(lambda w: (w, 1)) \ .reduceByKey(lambda a, b: a + b) \ .filter(lambda x: x[1] >= 5) \ .sortBy(lambda x: x[1], ascending=False)
cleaned_counts.take(20)Word Count Concepts Illustrated
This simple program demonstrates:
| Concept | Where in Word Count |
|---|---|
| Transformation | flatMap, map, filter |
| Action | take(), saveAsTextFile() |
| Lazy evaluation | None of the transforms run until take() |
| Shuffle | reduceByKey — groups by word across partitions |
| Broadcast | Stopwords list sent once to all executors |
| Lineage | counts ← reduceByKey ← map ← flatMap ← textFile |