Cloud  /  Google Cloud

GCP Google Cloud Platform 25 guides · updated 2026

Guides to BigQuery, Vertex AI, GKE, Dataflow, and the rest of Google's data- and AI-first cloud — written for engineers shipping real workloads.

Google Cloud Dataflow: Apache Beam-Based Unified Stream and Batch Processing

Most data processing systems force a choice: either your code handles streaming data, or it handles batch data. Apache Beam, the programming model that Dataflow runs, rejects this distinction. You write one pipeline that describes what you want to happen to your data. Dataflow executes it, whether that data comes from a file in GCS or a continuously flowing Pub/Sub topic.

This guide covers the Apache Beam model, the key transforms, windowing for streaming, Flex Templates for deployment, and how Dataflow differs from Dataproc and Spark.


The Apache Beam Model

Beam has four core abstractions:

Pipeline
├── PCollection (distributed data set — could be bounded or unbounded)
├── PTransform (operation on a PCollection)
│ ├── Read source
│ ├── ParDo (per-element function — most common transform)
│ ├── GroupByKey (shuffle and group)
│ ├── Combine (aggregation within group)
│ └── Write sink
└── Runner (execution environment — DataflowRunner for GCP)

A pipeline is a directed acyclic graph of PTransforms applied to PCollections. When you call .run(), the Runner (Dataflow) optimizes and executes the graph.


A Basic Batch Pipeline

Read CSV from Cloud Storage, parse it, filter rows, and write to BigQuery:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions(
runner="DataflowRunner",
project="my-project",
region="us-central1",
temp_location="gs://my-bucket/temp",
staging_location="gs://my-bucket/staging",
)
BQ_SCHEMA = "order_id:STRING,customer_id:STRING,amount:FLOAT,order_date:DATE"
def parse_csv(line):
parts = line.split(",")
return {
"order_id": parts[0],
"customer_id": parts[1],
"amount": float(parts[2]),
"order_date": parts[3],
}
def is_valid_amount(record):
return record["amount"] > 0
with beam.Pipeline(options=options) as p:
(
p
| "Read orders" >> beam.io.ReadFromText("gs://my-bucket/orders/*.csv",
skip_header_lines=1)
| "Parse CSV" >> beam.Map(parse_csv)
| "Filter invalid" >> beam.Filter(is_valid_amount)
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
"my-project:analytics.orders",
schema=BQ_SCHEMA,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
)
)

Switch to DirectRunner for local testing without provisioning Dataflow workers:

options = PipelineOptions(runner="DirectRunner")

Streaming Pipeline: Pub/Sub to BigQuery

The same code structure handles streaming by reading from Pub/Sub and enabling streaming mode:

import json
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, StandardOptions
options = PipelineOptions(
runner="DataflowRunner",
project="my-project",
region="us-central1",
temp_location="gs://my-bucket/temp",
)
options.view_as(StandardOptions).streaming = True
def parse_message(message_bytes):
return json.loads(message_bytes.decode("utf-8"))
def add_processing_time(record):
import apache_beam as beam
record["processed_at"] = beam.utils.timestamp.Timestamp.now().to_rfc3339()
return record
with beam.Pipeline(options=options) as p:
(
p
| "Read from Pub/Sub" >> beam.io.ReadFromPubSub(
topic="projects/my-project/topics/orders"
)
| "Parse JSON" >> beam.Map(parse_message)
| "Add timestamp" >> beam.Map(add_processing_time)
| "Write to BigQuery" >> beam.io.WriteToBigQuery(
"my-project:analytics.orders_stream",
schema="order_id:STRING,amount:FLOAT,processed_at:TIMESTAMP",
)
)

Windowing: Handling Streaming Time

In batch processing, time does not matter — you process all the data in the file. In streaming, you must decide how to group data in time before aggregating. Windowing defines these time boundaries.

Fixed (tumbling) windows:
time ──────────────────────────────────────────────────►
[ 0s-60s ] [ 60s-120s ] [ 120s-180s ]
window_1 window_2 window_3
Sliding windows:
[ 0s-60s ]
[15s-75s]
[30s-90s]
(windows overlap, each event appears in multiple windows)
Session windows:
user opens app: ──── [activity] ──── 30s gap ──── [activity] ────
[ session 1 ] [ session 2 ]
(groups bursts of activity with idle-gap separation)
# Count orders per minute using fixed 60-second windows
from apache_beam import window
(
messages
| "Parse" >> beam.Map(parse_message)
| "Extract key-amount" >> beam.Map(lambda r: (r["region"], r["amount"]))
| "Window 60s" >> beam.WindowInto(window.FixedWindows(60))
| "Group by region" >> beam.CombinePerKey(sum)
| "Format" >> beam.Map(lambda kv: {
"region": kv[0],
"total_amount": kv[1],
})
| "Write" >> beam.io.WriteToBigQuery("my-project:analytics.regional_totals")
)

Handling Late Data with Watermarks and Triggers

In streaming, events arrive out of order. A sensor reading from 10:00 AM might arrive at 10:05 AM due to network delay. Watermarks define how late an event can be and still be included in a window.

from apache_beam.transforms.trigger import (
AfterWatermark,
AfterProcessingTime,
AccumulationMode,
)
(
messages
| "Window with late data" >> beam.WindowInto(
window.FixedWindows(60),
trigger=AfterWatermark(
early=AfterProcessingTime(10), # emit early every 10 seconds
late=AfterProcessingTime(5), # re-emit 5 seconds after each late arrival
),
accumulation_mode=AccumulationMode.ACCUMULATING,
allowed_lateness=60, # accept data up to 60 seconds late
)
| ...
)

ACCUMULATING mode includes all previous data when a window fires again due to late arrivals, giving you a complete picture but requiring downstream deduplication.


Flex Templates: Parameterized Deployments

Classic Dataflow templates have limitations — all code is compiled into the template at build time. Flex Templates containerize the pipeline in a Docker image, allowing dynamic parameter resolution at launch time.

Terminal window
# Build a Flex Template
gcloud dataflow flex-template build \
gs://my-bucket/templates/orders-pipeline.json \
--image-gcr-path=gcr.io/my-project/orders-pipeline:latest \
--sdk-language=PYTHON \
--flex-template-base-image=PYTHON310 \
--metadata-file=metadata.json \
--py-path=. \
--env=FLEX_TEMPLATE_PYTHON_PY_FILE=main.py
# Launch the Flex Template
gcloud dataflow flex-template run orders-job-$(date +%Y%m%d) \
--template-file-gcs-location=gs://my-bucket/templates/orders-pipeline.json \
--region=us-central1 \
--parameters=input_topic=projects/my-project/topics/orders \
--parameters=output_table=my-project:analytics.orders

Flex Templates are the current recommended deployment approach for production pipelines.


Dataflow vs Dataproc vs BigQuery: Choosing the Right Service

┌────────────────────────────────────────────────────────────────────────────┐
│ Use case │ Best service │
├────────────────────────────────────┼───────────────────────────────────────┤
│ Streaming data transformation │ Dataflow (real-time, serverless) │
│ Batch ETL, CSV/JSON → BigQuery │ Dataflow (batch mode) or BigQuery │
│ │ LOAD directly │
│ Existing Spark/Hadoop codebase │ Dataproc (runs native Spark/Hadoop) │
│ SQL analytics on structured data │ BigQuery │
│ ML feature pipelines │ Dataflow (Beam transforms) │
│ Hive, Pig workloads │ Dataproc │
└────────────────────────────────────┴───────────────────────────────────────┘

Dataflow and Dataproc both run distributed data processing but target different workloads. Dataflow is best when you want serverless execution and write pipelines in the Beam model. Dataproc is best when you have existing Spark, Hadoop, Hive, or Pig code and want to run it on managed clusters with minimal code changes.


Runner V2 (Dataflow Engine V2)

Runner V2 is the current generation Dataflow execution engine. It uses a Beam Fn API for worker communication, supports portable pipelines across languages, and is required for newer features like streaming engine and horizontal autoscaling.

Terminal window
# Enable Runner V2 explicitly
gcloud dataflow flex-template run my-job \
--template-file-gcs-location=gs://my-bucket/templates/pipeline.json \
--region=us-central1 \
--additional-experiments=use_runner_v2

For streaming jobs, enabling streaming_engine offloads state management from workers to a managed service, reducing memory pressure and improving autoscaling responsiveness.


Summary

Dataflow’s value proposition is the unified model: one Apache Beam pipeline that handles both batch and streaming data, deployed serverlessly without managing clusters. Windowing solves the streaming time problem. Flex Templates enable production-grade deployable pipelines with dynamic parameterization. Runner V2 provides the modern execution engine. The practical workflow is local development with DirectRunner, testing with a small Dataflow job, and production deployment via Flex Templates triggered by Cloud Scheduler or Pub/Sub. The alternative services — Dataproc for Spark workloads, BigQuery for SQL analytics — complement Dataflow rather than replacing it.