Google Cloud BigQuery Storage Client Library
The Google Cloud BigQuery Storage API client library enables high-throughput data transfer from BigQuery tables. It leverages a binary serialization format (like Apache Arrow or Protobuf) for efficient data transfer and is ideal for analytical workloads requiring large-scale data extraction. The library is currently at version 2.36.2 and is released as part of the `google-cloud-python` monorepo, following a frequent release cadence.
Warnings
- breaking Major breaking changes occurred in version 2.0.0. The primary import path for clients changed from `google.cloud.bigquery_storage_v1` to `google.cloud.bigquery_storage`. Enum types moved from direct import or client access (e.g., `BigQueryReadClient.enums`) to the `types` module (e.g., `types.DataFormat.ARROW`). Existing code using the `_v1` suffix or direct enum access will fail.
- deprecated The `client_config` and `channel` parameters for client constructors have been removed.
- gotcha The BigQuery Storage API is optimized for high-throughput, large-scale data transfer, not for small, interactive queries or single-row lookups. Using it for small datasets may introduce unnecessary overhead compared to the standard BigQuery client library.
- gotcha Data read from the Storage API is returned in a binary format (Protobuf or Apache Arrow). To easily work with this data in Python, you typically need to convert it. This often requires additional dependencies like `pandas` and `pyarrow` (for `to_dataframe()`) or `fastavro` (for `rows()` to get dicts from AVRO).
- gotcha While `BigQueryReadClient.create_read_session` allows specifying `max_stream_count` for parallelism, achieving true concurrent data processing in Python often requires using the `multiprocessing` module rather than simple threading due to Python's Global Interpreter Lock (GIL).
- gotcha The `google-cloud-bigquery` client library, which often complements `google-cloud-bigquery-storage`, has ended support for Python 3.7 and 3.8. Although `google-cloud-bigquery-storage` officially supports Python >=3.7, it is highly recommended to upgrade to Python 3.9+ to maintain compatibility with the broader Google Cloud client ecosystem and ensure ongoing support.
Install
-
pip install google-cloud-bigquery-storage -
pip install google-cloud-bigquery-storage[fastavro,pandas,pyarrow]
Imports
- BigQueryReadClient
from google.cloud.bigquery_storage import BigQueryReadClient
- BigQueryWriteClient
from google.cloud.bigquery_storage import BigQueryWriteClient
- types
from google.cloud.bigquery_storage import types
Quickstart
import os
from google.cloud.bigquery_storage import BigQueryReadClient, types
# Your Google Cloud project ID. If not set, it will default to the project
# defined in your environment (e.g., GOOGLE_CLOUD_PROJECT, gcloud config).
project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', '')
if not project_id:
raise ValueError("GOOGLE_CLOUD_PROJECT environment variable must be set")
# Public BigQuery dataset and table to read from
parent = f"projects/{project_id}"
dataset_id = "google_trends"
table_id = "international_top_rising_terms"
table = f"projects/bigquery-public-data/datasets/{dataset_id}/tables/{table_id}"
def read_bigquery_table_storage(project_id, table):
"""Reads data from a BigQuery table using the BigQuery Storage Read API."""
client = BigQueryReadClient()
# Specify the table and desired data format (Arrow recommended for performance)
read_options = types.ReadSession.TableReadOptions(selected_fields=["country_name", "region_name"])
requested_session = types.ReadSession(
table=table,
data_format=types.DataFormat.ARROW, # Or types.DataFormat.AVRO
read_options=read_options,
)
# Create a read session
read_session = client.create_read_session(
parent=parent,
read_session=requested_session,
max_stream_count=1 # Adjust for parallelism if needed
)
print(f"Read session created: {read_session.name}")
# Read from the first stream (assuming max_stream_count=1)
stream_name = read_session.streams[0].name
reader = client.read_rows(stream_name)
# Convert to Pandas DataFrame (requires pandas and pyarrow installed)
import pandas as pd
try:
dataframe = reader.to_dataframe()
print("Successfully read data into Pandas DataFrame.")
print(dataframe.head())
return dataframe
except ImportError as e:
print(f"Could not convert to DataFrame: {e}. Try iterating rows manually.")
print("Reading rows directly:")
for row_message in reader.rows(session=read_session):
# row_message will be a dict if fastavro is installed and format is AVRO,
# or a protobuf message if format is ARROW (requires manual parsing for dicts)
print(row_message)
break # Print first row and exit
if __name__ == "__main__":
# Ensure you have authenticated to GCP (e.g., `gcloud auth application-default login`)
# and enabled the BigQuery Storage API for your project.
# Replace 'your-gcp-project-id' with your actual project ID or set GOOGLE_CLOUD_PROJECT env var.
# Example uses a public dataset, so you mainly need read access to your billing project.
df = read_bigquery_table_storage(project_id, table)