Cloud  /  Google Cloud

GCP Google Cloud Platform 25 guides · updated 2026

Guides to BigQuery, Vertex AI, GKE, Dataflow, and the rest of Google's data- and AI-first cloud — written for engineers shipping real workloads.

Google Cloud Dataproc: Managed Spark and Hadoop That Spins Up in 90 Seconds

Before Dataproc, moving a Hadoop or Spark workload to the cloud required provisioning VMs, installing Java, configuring Hadoop distributions, setting up YARN, mounting storage, and doing this again every time you needed a cluster of a different size. Dataproc compresses that process to a single API call. A cluster with Spark, Hadoop, Hive, Hue, and JupyterLab ready to receive jobs spins up in about 90 seconds.

The key operational shift with Dataproc is treating clusters as ephemeral rather than permanent. Create a cluster for a job, run the job, delete the cluster. Store all data in Cloud Storage, not HDFS. Pay for compute only when jobs are running.


Architecture: What a Dataproc Cluster Contains

Dataproc Cluster
├── Master node(s)
│ ├── YARN ResourceManager
│ ├── HDFS NameNode (rarely used — data lives in GCS)
│ ├── Spark Driver (for interactive jobs)
│ └── Web UIs (Spark History Server, YARN RM, Hue)
├── Primary worker nodes (N)
│ ├── YARN NodeManager
│ ├── HDFS DataNode
│ └── Spark Executors
└── Secondary (preemptible) worker nodes (optional)
├── YARN NodeManager
└── Spark Executors (tasks only, no HDFS)
GCS acts as the data lake — both input and output
HDFS is used only for shuffle and temporary files

Single-node cluster (master only): for development and testing.

Standard cluster (1 master + N workers): for production workloads requiring HA at the job level.

High-availability cluster (3 masters + N workers): master nodes use ZooKeeper for HA. Use when cluster availability during maintenance is critical.


Ephemeral Clusters: The Cost-Effective Pattern

Most teams running Dataproc make the mistake of running clusters 24/7 like on-premises Hadoop deployments. The GCP pattern is different:

Traditional pattern (expensive):
Cluster: 10 nodes, runs 24/7
Cost: ~$1,200/month
Utilization: 20% (jobs run for ~5 hours/day)
Ephemeral pattern (cost-efficient):
Cluster created: 5 minutes before job starts
Job runs: 4 hours
Cluster deleted: immediately after job completes
Cost per day: ~$6 for 4 hours × 10 nodes
Monthly cost: ~$180
Savings: 85%

Store all data in Cloud Storage:

Terminal window
# Do NOT put data in HDFS (lost when cluster is deleted)
# Use gs:// paths for all input and output
gcloud dataproc jobs submit pyspark gs://my-bucket/jobs/daily_etl.py \
--cluster=etl-cluster \
--region=us-central1 \
-- \
--input=gs://my-bucket/raw/2025/03/15/ \
--output=gs://my-bucket/processed/2025/03/15/

Creating and Deleting Clusters Programmatically

Terminal window
# Create a cluster with preemptible secondary workers
gcloud dataproc clusters create etl-cluster \
--region=us-central1 \
--master-machine-type=n2-standard-4 \
--master-boot-disk-size=50GB \
--num-workers=5 \
--worker-machine-type=n2-standard-8 \
--worker-boot-disk-size=100GB \
--num-secondary-workers=10 \
--secondary-worker-type=preemptible \
--image-version=2.1-debian11 \
--optional-components=JUPYTER,HIVE_WEBHCAT \
--max-idle=1h \
--max-age=24h
# Submit the job
gcloud dataproc jobs submit pyspark gs://my-bucket/jobs/etl.py \
--cluster=etl-cluster \
--region=us-central1
# Delete when done (or rely on --max-idle)
gcloud dataproc clusters delete etl-cluster --region=us-central1 --quiet

--max-idle=1h automatically deletes the cluster after 1 hour of inactivity. --max-age=24h deletes it after 24 hours regardless of activity. These safeguards prevent forgotten clusters from running indefinitely.


Autoscaling Policies

For long-running clusters where job load fluctuates, autoscaling adds and removes workers automatically.

Terminal window
# Create an autoscaling policy
gcloud dataproc autoscaling-policies import my-autoscaling-policy \
--region=us-central1 \
--source=- << 'EOF'
workerConfig:
minInstances: 2
maxInstances: 20
weight: 1
secondaryWorkerConfig:
minInstances: 0
maxInstances: 50
weight: 1
basicAlgorithm:
cooldownPeriod: 4m
yarnConfig:
scaleUpFactor: 1.0
scaleDownFactor: 1.0
scaleUpMinWorkerFraction: 0.0
scaleDownMinWorkerFraction: 0.0
gracefulDecommissionTimeout: 1h
EOF
# Attach to cluster
gcloud dataproc clusters create autoscaling-cluster \
--region=us-central1 \
--autoscaling-policy=my-autoscaling-policy

The autoscaler monitors YARN pending memory — when jobs are queued waiting for resources, it adds workers. When workers are idle for the cooldown period, it removes them.


Running Spark Jobs

# PySpark job stored in GCS: gs://my-bucket/jobs/sales_report.py
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("SalesReport").getOrCreate()
# Read from GCS (not HDFS)
orders = spark.read.parquet("gs://my-bucket/raw/orders/")
customers = spark.read.parquet("gs://my-bucket/raw/customers/")
# Transform
revenue_by_region = (
orders.join(customers, orders.customer_id == customers.id)
.groupBy("region")
.agg(
F.sum("amount").alias("total_revenue"),
F.count("order_id").alias("order_count"),
F.avg("amount").alias("avg_order_value"),
)
.orderBy(F.desc("total_revenue"))
)
# Write results to GCS
revenue_by_region.write.parquet(
"gs://my-bucket/reports/revenue_by_region/",
mode="overwrite",
)
spark.stop()

Submit this job:

Terminal window
gcloud dataproc jobs submit pyspark gs://my-bucket/jobs/sales_report.py \
--cluster=etl-cluster \
--region=us-central1 \
--properties=spark.sql.shuffle.partitions=200

Hive and Presto for SQL Users

Not every team wants to write Spark code. Dataproc includes Hive and Presto for SQL-based querying of data in GCS.

Terminal window
# Submit a Hive job
gcloud dataproc jobs submit hive \
--cluster=etl-cluster \
--region=us-central1 \
--execute='
CREATE EXTERNAL TABLE IF NOT EXISTS orders (
order_id STRING,
customer_id STRING,
amount DOUBLE,
order_date STRING
)
STORED AS PARQUET
LOCATION "gs://my-bucket/raw/orders/";
SELECT
DATE_FORMAT(order_date, "yyyy-MM") AS month,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= "2025-01-01"
GROUP BY 1
ORDER BY 1;
'

Hive translates SQL to MapReduce or Tez jobs. Presto executes SQL directly in memory across workers for much faster interactive queries.


Dataproc Serverless: No Cluster Management

Dataproc Serverless lets you submit Spark batch or notebook jobs without creating or managing a cluster. You submit the job, Dataflow allocates workers, runs the job, and terminates resources automatically.

Terminal window
gcloud dataproc batches submit pyspark gs://my-bucket/jobs/etl.py \
--region=us-central1 \
--deps-bucket=gs://my-bucket/deps \
--version=2.1 \
--properties=spark.executor.cores=4,spark.executor.memory=8g

Dataproc Serverless is simpler operationally but has constraints:

For jobs that need Spark and want zero cluster management, Dataproc Serverless is the right choice. For jobs requiring Hadoop ecosystem tools or long-running clusters, managed Dataproc clusters remain the better option.


Initialization Actions: Customizing Cluster Setup

Run scripts on all cluster nodes at creation time to install libraries or configure software:

Terminal window
gcloud dataproc clusters create custom-cluster \
--region=us-central1 \
--initialization-actions=gs://my-bucket/init/install-deps.sh
# install-deps.sh content:
# #!/bin/bash
# pip install scikit-learn pandas boto3
# apt-get install -y libgomp1

Dataproc vs Dataflow: Side by Side

Dataflow (Apache Beam):
+ Serverless (no cluster provisioning)
+ Unified batch/stream model
+ Fully managed autoscaling
- Must write in Beam SDK (different from Spark)
- Harder to migrate existing Spark code
Dataproc (Spark/Hadoop):
+ Runs existing Spark, Hadoop, Hive, Pig code unchanged
+ Interactive notebooks (JupyterLab on cluster)
+ Richer ecosystem (Presto, HBase, Oozie, etc.)
- Need to manage/monitor clusters
- Streaming requires Spark Structured Streaming

Summary

Dataproc’s core advantage is running open-source big data frameworks — Spark, Hadoop, Hive, Presto — in a managed GCP service without configuration overhead. The ephemeral cluster pattern reduces costs dramatically compared to running permanent clusters. Storing data in GCS rather than HDFS makes clusters disposable. Autoscaling handles variable job loads. Dataproc Serverless removes cluster management entirely for Spark-only workloads. The decision between Dataproc and Dataflow comes down to existing code and workload type: Dataproc for Spark/Hadoop ecosystems, Dataflow for new streaming pipelines and ETL built on the Apache Beam model.