Google Cloud BigQuery Storage Client Library

2.36.2 · active · verified Sun Mar 29

The Google Cloud BigQuery Storage API client library enables high-throughput data transfer from BigQuery tables. It leverages a binary serialization format (like Apache Arrow or Protobuf) for efficient data transfer and is ideal for analytical workloads requiring large-scale data extraction. The library is currently at version 2.36.2 and is released as part of the `google-cloud-python` monorepo, following a frequent release cadence.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `BigQueryReadClient` to read data from a public BigQuery table using the Storage API. It configures a read session, reads data in Apache Arrow format, and attempts to convert it to a Pandas DataFrame. Remember to set the `GOOGLE_CLOUD_PROJECT` environment variable and ensure the BigQuery Storage API is enabled for your project.

import os
from google.cloud.bigquery_storage import BigQueryReadClient, types

# Your Google Cloud project ID. If not set, it will default to the project
# defined in your environment (e.g., GOOGLE_CLOUD_PROJECT, gcloud config).
project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', '')
if not project_id:
    raise ValueError("GOOGLE_CLOUD_PROJECT environment variable must be set")

# Public BigQuery dataset and table to read from
parent = f"projects/{project_id}"
dataset_id = "google_trends"
table_id = "international_top_rising_terms"
table = f"projects/bigquery-public-data/datasets/{dataset_id}/tables/{table_id}"

def read_bigquery_table_storage(project_id, table):
    """Reads data from a BigQuery table using the BigQuery Storage Read API."""
    client = BigQueryReadClient()

    # Specify the table and desired data format (Arrow recommended for performance)
    read_options = types.ReadSession.TableReadOptions(selected_fields=["country_name", "region_name"])
    requested_session = types.ReadSession(
        table=table,
        data_format=types.DataFormat.ARROW, # Or types.DataFormat.AVRO
        read_options=read_options,
    )

    # Create a read session
    read_session = client.create_read_session(
        parent=parent,
        read_session=requested_session,
        max_stream_count=1 # Adjust for parallelism if needed
    )

    print(f"Read session created: {read_session.name}")

    # Read from the first stream (assuming max_stream_count=1)
    stream_name = read_session.streams[0].name
    reader = client.read_rows(stream_name)

    # Convert to Pandas DataFrame (requires pandas and pyarrow installed)
    import pandas as pd
    try:
        dataframe = reader.to_dataframe()
        print("Successfully read data into Pandas DataFrame.")
        print(dataframe.head())
        return dataframe
    except ImportError as e:
        print(f"Could not convert to DataFrame: {e}. Try iterating rows manually.")
        print("Reading rows directly:")
        for row_message in reader.rows(session=read_session):
            # row_message will be a dict if fastavro is installed and format is AVRO, 
            # or a protobuf message if format is ARROW (requires manual parsing for dicts)
            print(row_message)
            break # Print first row and exit

if __name__ == "__main__":
    # Ensure you have authenticated to GCP (e.g., `gcloud auth application-default login`)
    # and enabled the BigQuery Storage API for your project.
    # Replace 'your-gcp-project-id' with your actual project ID or set GOOGLE_CLOUD_PROJECT env var.
    # Example uses a public dataset, so you mainly need read access to your billing project.
    df = read_bigquery_table_storage(project_id, table)

view raw JSON →