Core Apache Spark Concepts
- Resilient Distributed Dataset (RDD)
- DataFrames
- Datasets
- Transformations
- Actions
- Lazy Evaluation
- SparkSession
- SparkContext
- Partitions
- Shuffling
- Persistence & Caching
- Lineage Graphs
- Jobs
- Stages
- Tasks
Apache Spark
- Apache Spark: Big Data Processing & Analytics
- Spark DataFrames: Features, Use Cases & Optimization for Big Data
- Spark Architecture
- Dataframe create from file
- Dataframe Pyspark create from collections
- Spark Dataframe save as csv
- Dataframe save as parquet
- Dataframe show() between take() methods
- Apache SparkSession
- Understanding the RDD of Apache Spark
- Spark RDD creation from collection
- Different method to print data from rdd
- Practical use of unionByName method
- Creating Spark DataFrames: Methods & Examples
- Setup Spark in PyCharm
- Apache Spark all APIs
- Spark for the word count program
- Spark Accumulators
- aggregateByKey in Apache Spark
- Spark Broadcast with Examples
- Spark combineByKey
- Apache Spark Using countByKey
- Spark CrossJoin know all
- Optimizing Spark groupByKey: Usage, Best Practices, and Examples
- Mastering Spark Joins: Inner, Outer, Left, Right & Semi Joins Explained
- Apache Spark: Local Mode vs Cluster Mode - Key Differences & Examples
- Spark map vs flatMap: Key Differences with Examples
- Efficient Data Processing with Spark mapPartitionsWithIndex
- Spark reduceByKey with 5 Real-World Examples
- Spark Union vs UnionAll vs Union Available – Key Differences & Examples
What Are Transformations in Apache Spark?
In Apache Spark, transformations are operations that create a new Resilient Distributed Dataset (RDD) or DataFrame from an existing one. Unlike actions (which trigger computation), transformations are lazy—they don’t execute immediately but instead build a logical execution plan.
Key Characteristics of Transformations:
✔ Lazy Evaluation – Computations happen only when an action is called.
✔ Immutable – Original RDD/DataFrame remains unchanged.
✔ Optimized Execution – Spark optimizes transformations before running them.
Why Are Transformations Important?
- Efficiency – Lazy evaluation avoids unnecessary computations.
- Fault Tolerance – Lineage (dependency graph) helps recover lost data.
- Optimization – Spark’s Catalyst Optimizer improves query execution.
- Scalability – Works on distributed big data seamlessly.
Must-Know Spark Transformations
1. map()
– Apply a Function to Each Element
- Processes each element of an RDD/DataFrame.
- Returns a new RDD with transformed values.
Example 1: Convert Strings to Uppercase (RDD)
from pyspark import SparkContext
sc = SparkContext("local", "MapExample")data = ["spark", "hadoop", "flink"]rdd = sc.parallelize(data)
# Using map to uppercase each elementupper_rdd = rdd.map(lambda x: x.upper())print(upper_rdd.collect()) # Output: ['SPARK', 'HADOOP', 'FLINK']
Example 2: Square Numbers (RDD)
numbers = [1, 2, 3, 4, 5]num_rdd = sc.parallelize(numbers)
squared_rdd = num_rdd.map(lambda x: x ** 2)print(squared_rdd.collect()) # Output: [1, 4, 9, 16, 25]
Example 3: Extract First Letter (DataFrame)
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("MapExampleDF").getOrCreate()
data = [("alice",), ("bob",), ("charlie",)]df = spark.createDataFrame(data, ["name"])
from pyspark.sql.functions import coldf_transformed = df.select(col("name").substr(1, 1).alias("first_letter"))df_transformed.show()# Output:# +------------+# |first_letter|# +------------+# | a|# | b|# | c|# +------------+
2. filter()
– Select Elements Based on a Condition
- Returns a new RDD/DataFrame with elements that meet a condition.
Example 1: Filter Even Numbers (RDD)
numbers = [1, 2, 3, 4, 5, 6]num_rdd = sc.parallelize(numbers)
even_rdd = num_rdd.filter(lambda x: x % 2 == 0)print(even_rdd.collect()) # Output: [2, 4, 6]
Example 2: Filter Names Starting with ‘A’ (RDD)
names = ["alice", "bob", "anna", "dave"]names_rdd = sc.parallelize(names)
filtered_names = names_rdd.filter(lambda x: x.startswith('a'))print(filtered_names.collect()) # Output: ['alice', 'anna']
Example 3: Filter DataFrame Rows (Salary > 50000)
data = [("alice", 60000), ("bob", 45000), ("charlie", 70000)]df = spark.createDataFrame(data, ["name", "salary"])
filtered_df = df.filter(df.salary > 50000)filtered_df.show()# Output:# +-------+------+# | name|salary|# +-------+------+# | alice| 60000|# |charlie| 70000|# +-------+------+
3. flatMap()
– Transform and Flatten Results
- Applies a function to each element and flattens the results (unlike
map
, which keeps structure).
Example 1: Split Sentences into Words (RDD)
sentences = ["Hello world", "Apache Spark"]sent_rdd = sc.parallelize(sentences)
words_rdd = sent_rdd.flatMap(lambda x: x.split(" "))print(words_rdd.collect()) # Output: ['Hello', 'world', 'Apache', 'Spark']
Example 2: Generate Pairs from Numbers (RDD)
numbers = [1, 2, 3]num_rdd = sc.parallelize(numbers)
pairs_rdd = num_rdd.flatMap(lambda x: [(x, x*1), (x, x*2)])print(pairs_rdd.collect())# Output: [(1, 1), (1, 2), (2, 2), (2, 4), (3, 3), (3, 6)]
Example 3: Explode Array Column (DataFrame)
from pyspark.sql.functions import explode
data = [("alice", ["java", "python"]), ("bob", ["scala"])]df = spark.createDataFrame(data, ["name", "skills"])
exploded_df = df.select("name", explode("skills").alias("skill"))exploded_df.show()# Output:# +-----+------+# | name| skill|# +-----+------+# |alice| java|# |alice|python|# | bob| scala|# +-----+------+
How to Remember Transformations for Interviews & Exams
- Lazy vs. Eager – Transformations are lazy, actions trigger execution.
- Common Transformations –
map
,filter
,flatMap
,groupBy
,join
. - Think in Stages – Each transformation builds a step in the execution plan.
- Practice with Examples – Write small Spark jobs to reinforce concepts.
Conclusion
Spark transformations (map
, filter
, flatMap
, etc.) are fundamental for efficient big data processing. They enable lazy evaluation, fault tolerance, and optimized execution.
Key Takeaways:
✅ Use map()
for element-wise transformations.
✅ Apply filter()
to select data conditionally.
✅ flatMap()
helps flatten nested structures.
✅ Always remember: Transformations are lazy until an action is called!
By mastering these concepts, you’ll be well-prepared for Spark interviews, exams, and real-world big data projects!