Spark for the word count program

Spark for the word count program

Using Spark for the word count program offers several advantages. Firstly, Spark’s ability to distribute data across multiple nodes in a cluster allows for parallel processing, dramatically reducing the processing time for large datasets. Additionally, Spark’s resilient distributed datasets (RDDs) enable fault tolerance, ensuring that processing continues even if a node fails. This makes Spark an ideal choice for handling large-scale word count tasks in a robust manner.

Program :

Step 1:

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
  .master("local[1]") \
  .appName("NpBlue.com") \
  .getOrCreate()

Step 2:

lines_rdd = spark.sparkContext.parallelize([
"Using Spark for the word count program offers several advantages. Firstly, Spark's ability to distribute data across multiple nodes in a cluster allows for parallel processing, dramatically reducing the processing time for large datasets. ",
"Additionally, Spark's resilient distributed datasets (RDDs) enable fault tolerance, ensuring that processing continues even if a node fails. This makes Spark an ideal choice for handling large-scale word count tasks in a robust manner."
])

Step 3:

# Using flatMap to split lines into individual words
words_rdd = lines_rdd.flatMap(lambda line: line.split(" "))
words_rdd.foreach(print)

Using
Spark
for
the
word
count
program
offers
several
advantages.
Firstly,
Spark's
ability
to
distribute
data
across
multiple
nodes
in
a
cluster
allows
.... so on

Step 4:

# Using map to convert each word into a key-value pair, with the word as the key and count 1 as the value
word_count_pairs_rdd = words_rdd.map(lambda word: (word, 1))
word_count_pairs_rdd.foreach(print)


('Using', 1)
('Spark', 1)
('for', 1)
('the', 1)
('word', 1)
('count', 1)
('program', 1)
('offers', 1)
('several', 1)
('advantages.', 1)
('Firstly,', 1)
--- so on --

Step 5:

# Using reduceByKey to sum the counts for each word
word_counts_rdd = word_count_pairs_rdd.reduceByKey(lambda x, y: x + y)

Step 6:

# Collecting and displaying the word count result
result = word_counts_rdd.collect()
for word, count in result:
print(f"{word}: {count}")

final Output :

Using: 1 Spark: 2 for: 4 the: 2 word: 2 count: 2 program: 1 offers: 1 several: 1 advantages.: 1 Firstly,: 1 Spark's: 2 ability: 1 to: 1 distribute: 1 data: 1 across: 1 multiple: 1 nodes: 1 in: 2 a: 3 cluster: 1 allows: 1 parallel: 1 processing,: 1 dramatically: 1 reducing: 1 processing: 2 time: 1 large: 1 datasets.: 1 : 1 Additionally,: 1 resilient: 1 distributed: 1 datasets: 1 (RDDs): 1 enable: 1 fault: 1 tolerance,: 1 ensuring: 1 that: 1 continues: 1 even: 1 if: 1 node: 1 fails.: 1 This: 1 makes: 1 an: 1 ideal: 1 choice: 1 handling: 1 large-scale: 1 tasks: 1 robust: 1 manner.: 1

Core Apache Spark Concepts

Apache Spark