Technology  /  Apache Spark

Apache Spark 49 guides · updated 2026

Distributed data processing at scale — RDDs, DataFrames, Structured Streaming, and the tuning techniques that keep Spark jobs fast and cheap.

Setting Up Apache Spark with PyCharm

Running PySpark locally in PyCharm gives you a fast development feedback loop — write and test Spark transformations on your laptop before submitting to a production cluster.


Prerequisites


Step 1: Install Java and Set JAVA_HOME

Terminal window
# macOS (Homebrew)
brew install openjdk@17
export JAVA_HOME=$(brew --prefix openjdk@17)/libexec/openjdk.jdk/Contents/Home
# Ubuntu/Debian
sudo apt-get install openjdk-17-jdk
export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
# Windows (set in System Properties → Environment Variables)
# JAVA_HOME = C:\Program Files\Java\jdk-17
# Verify
java -version # openjdk 17.x.x

Step 2: Create a PyCharm Project with Virtual Environment

  1. Open PyCharm → New Project
  2. Choose Pure Python project type
  3. Select New virtual environment using venv
  4. Click Create

Step 3: Install PySpark

In PyCharm’s terminal (or via Settings → Python Packages):

Terminal window
pip install pyspark==3.5.1

This installs Spark itself — no separate Spark download needed for local development.


Step 4: Configure PyCharm Environment Variables

Go to RunEdit ConfigurationsEnvironment Variables and add:

JAVA_HOME = /path/to/jdk17
SPARK_LOCAL_IP = 127.0.0.1
PYSPARK_PYTHON = /path/to/venv/bin/python

Step 5: Write Your First PySpark App

Create spark_app.py:

from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName("PyCharm Spark") \
.master("local[*]") \
.config("spark.driver.memory", "2g") \
.getOrCreate()
# Suppress verbose logging
spark.sparkContext.setLogLevel("WARN")
# Sample DataFrame
data = [
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Carol", "Engineering", 110000),
]
df = spark.createDataFrame(data, ["name", "department", "salary"])
result = df.groupBy("department").agg(
F.avg("salary").alias("avg_salary"),
F.count("*").alias("headcount")
)
result.show()
spark.stop()

Step 6: Run and Debug

Right-click spark_app.pyRun. The Spark UI is available at http://localhost:4040 while the app is running.

For debugging, click the gutter next to a line to set a breakpoint, then use Debug instead of Run. Spark’s lazy evaluation means execution only happens at action calls (.show(), .count()), so set breakpoints after those lines to inspect results.


Common Issues

IssueFix
JAVA_HOME not setSet JAVA_HOME in system env or Run Config
Python worker failed to connectSet SPARK_LOCAL_IP=127.0.0.1
Module not found: pysparkEnsure correct venv is selected in PyCharm interpreter
Spark UI not loadingApp may have finished; UI closes when SparkSession stops