Setting Up Apache Spark with PyCharm
Running PySpark locally in PyCharm gives you a fast development feedback loop — write and test Spark transformations on your laptop before submitting to a production cluster.
Prerequisites
- Java 11 or 17 (Spark 3.x requires Java 8+; Java 17 is the 2025 recommendation)
- Python 3.9+
- PyCharm (Community or Professional)
- Apache Spark 3.5 (optional if using PySpark from pip)
Step 1: Install Java and Set JAVA_HOME
# macOS (Homebrew)brew install openjdk@17export JAVA_HOME=$(brew --prefix openjdk@17)/libexec/openjdk.jdk/Contents/Home
# Ubuntu/Debiansudo apt-get install openjdk-17-jdkexport JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
# Windows (set in System Properties → Environment Variables)# JAVA_HOME = C:\Program Files\Java\jdk-17
# Verifyjava -version # openjdk 17.x.xStep 2: Create a PyCharm Project with Virtual Environment
- Open PyCharm → New Project
- Choose Pure Python project type
- Select New virtual environment using venv
- Click Create
Step 3: Install PySpark
In PyCharm’s terminal (or via Settings → Python Packages):
pip install pyspark==3.5.1This installs Spark itself — no separate Spark download needed for local development.
Step 4: Configure PyCharm Environment Variables
Go to Run → Edit Configurations → Environment Variables and add:
JAVA_HOME = /path/to/jdk17SPARK_LOCAL_IP = 127.0.0.1PYSPARK_PYTHON = /path/to/venv/bin/pythonStep 5: Write Your First PySpark App
Create spark_app.py:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \ .appName("PyCharm Spark") \ .master("local[*]") \ .config("spark.driver.memory", "2g") \ .getOrCreate()
# Suppress verbose loggingspark.sparkContext.setLogLevel("WARN")
# Sample DataFramedata = [ ("Alice", "Engineering", 95000), ("Bob", "Marketing", 72000), ("Carol", "Engineering", 110000),]df = spark.createDataFrame(data, ["name", "department", "salary"])
result = df.groupBy("department").agg( F.avg("salary").alias("avg_salary"), F.count("*").alias("headcount"))result.show()
spark.stop()Step 6: Run and Debug
Right-click spark_app.py → Run. The Spark UI is available at http://localhost:4040 while the app is running.
For debugging, click the gutter next to a line to set a breakpoint, then use Debug instead of Run. Spark’s lazy evaluation means execution only happens at action calls (.show(), .count()), so set breakpoints after those lines to inspect results.
Common Issues
| Issue | Fix |
|---|---|
JAVA_HOME not set | Set JAVA_HOME in system env or Run Config |
Python worker failed to connect | Set SPARK_LOCAL_IP=127.0.0.1 |
| Module not found: pyspark | Ensure correct venv is selected in PyCharm interpreter |
| Spark UI not loading | App may have finished; UI closes when SparkSession stops |