Google Cloud Bigtable: Wide-Column NoSQL for Time Series, IoT, and High-Throughput Reads
Bigtable is the database Google has run since 2004 — it stores data for Search indexing, Maps, Gmail, and YouTube. If your application needs to write hundreds of thousands of rows per second, maintain millisecond read latency at petabyte scale, and handle time-series or sparse data efficiently, Bigtable is the right tool. It is not a general-purpose database and should not be used as one.
The Data Model: Rows, Column Families, and Cells
Bigtable’s data model is deceptively simple but has important implications for performance:
Table: sensor_readings
Row key │ column family: data │ ┌────────────┬─────────────┬────────────┐ │ │ data:temp │ data:humidity│ data:status│──────────────────────────┼──┼────────────┼─────────────┼────────────┤sensor_A#2025-03-15T10:00 │ │ 24.5 │ 62.1 │ OK │sensor_A#2025-03-15T10:01 │ │ 24.7 │ 62.3 │ OK │sensor_B#2025-03-15T10:00 │ │ 31.2 │ 48.7 │ WARN │Row key: The only index. Rows are sorted lexicographically by row key. All “reads in a range” are scans of contiguous row keys. The row key is the most important design decision.
Column families: Groups of columns defined at table creation time. Column qualifiers within a family can be anything — you do not need to predefine which columns exist. A row can have data in some qualifiers but not others (sparse).
Cells: The intersection of a row and a column qualifier. Each cell can store multiple versions, each with a different timestamp. By default Bigtable keeps the most recent, but you can configure retention rules per column family.
Row Key Design: The Most Critical Decision
Because rows are sorted by key and there is no secondary index, every access pattern must be served by a scan of the row key space. Bad row key design leads to hotspots — one tablet server receives all writes while others sit idle.
Common patterns:
Time series — reverse timestamp
Naive key: sensor_A#2025-03-15T10:00 ← chronological order All new writes go to the end of the table ─► tablet server at the end gets all writes (hotspot)
Better key: sensor_A#<reversed_timestamp> Reversed timestamp = MAX_LONG - unix_timestamp_millis Newest data sorts to the front of the row space Reads for recent data are fast (start of table) Writes are distributed (new keys go at front, then different timestamp)Distributed write with hash prefix
Without prefix: user_A_2025-03-01 ← sequential user IDs → sequential keys → hotspot user_B_2025-03-01 user_C_2025-03-01
With salted prefix: 3_user_A_2025-03-01 ← hash(user_A) % 4 = 3 1_user_B_2025-03-01 ← hash(user_B) % 4 = 1 2_user_C_2025-03-01 ← distributes writes across tabletsThe trade-off with hashing is that range scans no longer work. You must scan each bucket separately.
Column Families and Garbage Collection
Column families are created at the DDL level. Each family can have its own garbage collection policy:
# Create a table with two column familiescbt -instance=my-instance createtable telemetrycbt -instance=my-instance createfamily telemetry readingscbt -instance=my-instance createfamily telemetry metadata
# Set garbage collection: keep only the 10 most recent versions in readingscbt -instance=my-instance setgcpolicy telemetry readings maxversions=10
# Keep readings data for no more than 90 dayscbt -instance=my-instance setgcpolicy telemetry readings maxage=90dGarbage collection runs in the background. Data past the policy threshold is not immediately deleted — it becomes invisible to reads and is removed during compaction.
App Profiles: Routing and Replication
App profiles control how client traffic routes to Bigtable clusters. This matters for multi-cluster (replicated) Bigtable instances.
Bigtable Instance├── Cluster 1: us-east1 (primary, write cluster)└── Cluster 2: us-west1 (replica)
App profile: "backend-writes" Routing: Single cluster (us-east1) Reason: Strong consistency for writes
App profile: "analytics-reads" Routing: Multi-cluster (automatic load balancing) Reason: Reads can go to either cluster, lower latency Consistency: Eventual (reads may lag slightly behind writes)This allows a single Bigtable instance to serve latency-sensitive transactional writes with strong consistency while simultaneously serving analytics reads with eventual consistency and load balancing.
Writing and Reading with Python
from google.cloud.bigtable import Clientfrom google.cloud.bigtable.row_filters import ColumnRangeFilter, CellsRowLimitFilter
client = Client(project="my-project")instance = client.instance("my-instance")table = instance.table("telemetry")
# Write a rowrow_key = f"sensor_A#{2**63 - 1716800000}".encode() # reversed timestamprow = table.direct_row(row_key)row.set_cell( column_family_id="readings", column="temperature", value=b"24.7",)row.set_cell( column_family_id="readings", column="humidity", value=b"62.1",)row.commit()
# Read a single rowrow = table.read_row(row_key)if row: temp = row.cells["readings"]["temperature".encode()][0].value print(f"Temperature: {temp.decode()}")
# Scan a range (get last 100 rows for sensor_A)prefix_start = b"sensor_A#"prefix_end = b"sensor_A$" # $ sorts after # in ASCIIrows = table.read_rows( start_key=prefix_start, end_key=prefix_end, limit=100, filter_=CellsRowLimitFilter(1), # only latest version per cell)for row in rows: print(row.row_key)Bigtable vs HBase Compatibility
Bigtable implements the HBase API, which means existing HBase applications can run against Cloud Bigtable with minimal code changes. The Java HBase client works against Bigtable by replacing the HBase connection configuration.
// HBase connection to Cloud BigtableConfiguration config = BigtableConfiguration.configure( "my-project", "my-instance");Connection connection = ConnectionFactory.createConnection(config);Table table = connection.getTable(TableName.valueOf("telemetry"));This compatibility is intentional — Bigtable allows organizations to migrate on-premises HBase workloads to a fully managed cloud service without rewriting application code.
Sizing and Performance
Bigtable performance scales linearly with the number of nodes:
Per node (approximate): 10,000 rows/second (small rows, simple lookups) 10 MB/s read throughput 10 MB/s write throughput
Starting point guidelines: Development: 1 node, SSD Moderate production: 3 nodes minimum (minimum for HA) High throughput: size based on throughput target e.g., 100,000 writes/second → ~10 nodes minimumNodes can be added or removed without downtime. Bigtable automatically rebalances tablet distribution across the new node count. There is a delay (typically 20 minutes to several hours for large datasets) before full performance improvement is realized as data rebalances.
Storage is separate from nodes — adding nodes does not add storage. Bigtable charges per node-hour and per GB stored.
Real-World Use Case: Fleet Telemetry
A logistics company tracks 50,000 vehicles. Each vehicle sends GPS location, speed, fuel level, and engine diagnostics every 10 seconds.
Write rate: 50,000 vehicles × 6 datapoints × every 10 seconds = 30,000 data points / second = needs roughly 3-4 Bigtable nodes (SSD)
Row key design: {vehicle_id}#{reversed_timestamp_ms} Reverse timestamp ensures latest data sorts first Prefix scan on {vehicle_id}# retrieves recent data for that vehicle
Column family: "telemetry" Columns: lat, lon, speed, fuel, engine_code
Garbage collection: keep 30 days of data (raw) Aggregated summaries exported nightly to BigQuery for historical analysisQuerying the last 5 minutes for vehicle VH-4821:
start_key = f"VH-4821#{2**63 - int(time.time()*1000)}".encode()end_key = f"VH-4821#{2**63 - int((time.time()-300)*1000)}".encode()rows = table.read_rows(start_key=start_key, end_key=end_key)Summary
Bigtable excels at what it was designed for: high-throughput, low-latency, large-scale storage for time-series, IoT telemetry, financial tick data, and user activity logs. The row key is not just a lookup identifier — it is the entire access pattern. Design it wrong and you get hotspots. Design it right and Bigtable scales linearly to millions of operations per second. Column families provide schema flexibility without schema rigidity. HBase compatibility eases migration of existing workloads. The key discipline is matching Bigtable’s strengths — fast single-key lookups and range scans — to your access patterns, and using BigQuery or a separate analytics layer for complex multi-dimensional queries.