Google Cloud Dataproc: Managed Spark and Hadoop That Spins Up in 90 Seconds
Before Dataproc, moving a Hadoop or Spark workload to the cloud required provisioning VMs, installing Java, configuring Hadoop distributions, setting up YARN, mounting storage, and doing this again every time you needed a cluster of a different size. Dataproc compresses that process to a single API call. A cluster with Spark, Hadoop, Hive, Hue, and JupyterLab ready to receive jobs spins up in about 90 seconds.
The key operational shift with Dataproc is treating clusters as ephemeral rather than permanent. Create a cluster for a job, run the job, delete the cluster. Store all data in Cloud Storage, not HDFS. Pay for compute only when jobs are running.
Architecture: What a Dataproc Cluster Contains
Dataproc Cluster├── Master node(s)│ ├── YARN ResourceManager│ ├── HDFS NameNode (rarely used — data lives in GCS)│ ├── Spark Driver (for interactive jobs)│ └── Web UIs (Spark History Server, YARN RM, Hue)│├── Primary worker nodes (N)│ ├── YARN NodeManager│ ├── HDFS DataNode│ └── Spark Executors│└── Secondary (preemptible) worker nodes (optional) ├── YARN NodeManager └── Spark Executors (tasks only, no HDFS)
GCS acts as the data lake — both input and outputHDFS is used only for shuffle and temporary filesSingle-node cluster (master only): for development and testing.
Standard cluster (1 master + N workers): for production workloads requiring HA at the job level.
High-availability cluster (3 masters + N workers): master nodes use ZooKeeper for HA. Use when cluster availability during maintenance is critical.
Ephemeral Clusters: The Cost-Effective Pattern
Most teams running Dataproc make the mistake of running clusters 24/7 like on-premises Hadoop deployments. The GCP pattern is different:
Traditional pattern (expensive): Cluster: 10 nodes, runs 24/7 Cost: ~$1,200/month Utilization: 20% (jobs run for ~5 hours/day)
Ephemeral pattern (cost-efficient): Cluster created: 5 minutes before job starts Job runs: 4 hours Cluster deleted: immediately after job completes Cost per day: ~$6 for 4 hours × 10 nodes Monthly cost: ~$180 Savings: 85%Store all data in Cloud Storage:
# Do NOT put data in HDFS (lost when cluster is deleted)# Use gs:// paths for all input and output
gcloud dataproc jobs submit pyspark gs://my-bucket/jobs/daily_etl.py \ --cluster=etl-cluster \ --region=us-central1 \ -- \ --input=gs://my-bucket/raw/2025/03/15/ \ --output=gs://my-bucket/processed/2025/03/15/Creating and Deleting Clusters Programmatically
# Create a cluster with preemptible secondary workersgcloud dataproc clusters create etl-cluster \ --region=us-central1 \ --master-machine-type=n2-standard-4 \ --master-boot-disk-size=50GB \ --num-workers=5 \ --worker-machine-type=n2-standard-8 \ --worker-boot-disk-size=100GB \ --num-secondary-workers=10 \ --secondary-worker-type=preemptible \ --image-version=2.1-debian11 \ --optional-components=JUPYTER,HIVE_WEBHCAT \ --max-idle=1h \ --max-age=24h
# Submit the jobgcloud dataproc jobs submit pyspark gs://my-bucket/jobs/etl.py \ --cluster=etl-cluster \ --region=us-central1
# Delete when done (or rely on --max-idle)gcloud dataproc clusters delete etl-cluster --region=us-central1 --quiet--max-idle=1h automatically deletes the cluster after 1 hour of inactivity. --max-age=24h deletes it after 24 hours regardless of activity. These safeguards prevent forgotten clusters from running indefinitely.
Autoscaling Policies
For long-running clusters where job load fluctuates, autoscaling adds and removes workers automatically.
# Create an autoscaling policygcloud dataproc autoscaling-policies import my-autoscaling-policy \ --region=us-central1 \ --source=- << 'EOF'workerConfig: minInstances: 2 maxInstances: 20 weight: 1secondaryWorkerConfig: minInstances: 0 maxInstances: 50 weight: 1basicAlgorithm: cooldownPeriod: 4m yarnConfig: scaleUpFactor: 1.0 scaleDownFactor: 1.0 scaleUpMinWorkerFraction: 0.0 scaleDownMinWorkerFraction: 0.0 gracefulDecommissionTimeout: 1hEOF
# Attach to clustergcloud dataproc clusters create autoscaling-cluster \ --region=us-central1 \ --autoscaling-policy=my-autoscaling-policyThe autoscaler monitors YARN pending memory — when jobs are queued waiting for resources, it adds workers. When workers are idle for the cooldown period, it removes them.
Running Spark Jobs
# PySpark job stored in GCS: gs://my-bucket/jobs/sales_report.pyfrom pyspark.sql import SparkSessionfrom pyspark.sql import functions as F
spark = SparkSession.builder.appName("SalesReport").getOrCreate()
# Read from GCS (not HDFS)orders = spark.read.parquet("gs://my-bucket/raw/orders/")customers = spark.read.parquet("gs://my-bucket/raw/customers/")
# Transformrevenue_by_region = ( orders.join(customers, orders.customer_id == customers.id) .groupBy("region") .agg( F.sum("amount").alias("total_revenue"), F.count("order_id").alias("order_count"), F.avg("amount").alias("avg_order_value"), ) .orderBy(F.desc("total_revenue")))
# Write results to GCSrevenue_by_region.write.parquet( "gs://my-bucket/reports/revenue_by_region/", mode="overwrite",)
spark.stop()Submit this job:
gcloud dataproc jobs submit pyspark gs://my-bucket/jobs/sales_report.py \ --cluster=etl-cluster \ --region=us-central1 \ --properties=spark.sql.shuffle.partitions=200Hive and Presto for SQL Users
Not every team wants to write Spark code. Dataproc includes Hive and Presto for SQL-based querying of data in GCS.
# Submit a Hive jobgcloud dataproc jobs submit hive \ --cluster=etl-cluster \ --region=us-central1 \ --execute=' CREATE EXTERNAL TABLE IF NOT EXISTS orders ( order_id STRING, customer_id STRING, amount DOUBLE, order_date STRING ) STORED AS PARQUET LOCATION "gs://my-bucket/raw/orders/";
SELECT DATE_FORMAT(order_date, "yyyy-MM") AS month, SUM(amount) AS revenue, COUNT(*) AS order_count FROM orders WHERE order_date >= "2025-01-01" GROUP BY 1 ORDER BY 1; 'Hive translates SQL to MapReduce or Tez jobs. Presto executes SQL directly in memory across workers for much faster interactive queries.
Dataproc Serverless: No Cluster Management
Dataproc Serverless lets you submit Spark batch or notebook jobs without creating or managing a cluster. You submit the job, Dataflow allocates workers, runs the job, and terminates resources automatically.
gcloud dataproc batches submit pyspark gs://my-bucket/jobs/etl.py \ --region=us-central1 \ --deps-bucket=gs://my-bucket/deps \ --version=2.1 \ --properties=spark.executor.cores=4,spark.executor.memory=8gDataproc Serverless is simpler operationally but has constraints:
- No Hadoop, Hive, or Presto (Spark only)
- Cold start adds 2-3 minutes before the first task runs
- Limited configuration compared to managed clusters
- Cannot attach to an existing cluster’s HDFS
For jobs that need Spark and want zero cluster management, Dataproc Serverless is the right choice. For jobs requiring Hadoop ecosystem tools or long-running clusters, managed Dataproc clusters remain the better option.
Initialization Actions: Customizing Cluster Setup
Run scripts on all cluster nodes at creation time to install libraries or configure software:
gcloud dataproc clusters create custom-cluster \ --region=us-central1 \ --initialization-actions=gs://my-bucket/init/install-deps.sh
# install-deps.sh content:# #!/bin/bash# pip install scikit-learn pandas boto3# apt-get install -y libgomp1Dataproc vs Dataflow: Side by Side
Dataflow (Apache Beam): + Serverless (no cluster provisioning) + Unified batch/stream model + Fully managed autoscaling - Must write in Beam SDK (different from Spark) - Harder to migrate existing Spark code
Dataproc (Spark/Hadoop): + Runs existing Spark, Hadoop, Hive, Pig code unchanged + Interactive notebooks (JupyterLab on cluster) + Richer ecosystem (Presto, HBase, Oozie, etc.) - Need to manage/monitor clusters - Streaming requires Spark Structured StreamingSummary
Dataproc’s core advantage is running open-source big data frameworks — Spark, Hadoop, Hive, Presto — in a managed GCP service without configuration overhead. The ephemeral cluster pattern reduces costs dramatically compared to running permanent clusters. Storing data in GCS rather than HDFS makes clusters disposable. Autoscaling handles variable job loads. Dataproc Serverless removes cluster management entirely for Spark-only workloads. The decision between Dataproc and Dataflow comes down to existing code and workload type: Dataproc for Spark/Hadoop ecosystems, Dataflow for new streaming pipelines and ETL built on the Apache Beam model.