Google Cloud BigQuery Storage Client Library

raw JSON →
2.36.2 verified Tue May 12 auth: no python install: verified quickstart: stale

The Google Cloud BigQuery Storage API client library enables high-throughput data transfer from BigQuery tables. It leverages a binary serialization format (like Apache Arrow or Protobuf) for efficient data transfer and is ideal for analytical workloads requiring large-scale data extraction. The library is currently at version 2.36.2 and is released as part of the `google-cloud-python` monorepo, following a frequent release cadence.

pip install google-cloud-bigquery-storage
error ModuleNotFoundError: No module named 'google.cloud.bigquery_storage'
cause The `google-cloud-bigquery-storage` library is not installed in the Python environment, or the Python interpreter cannot find it in its path.
fix
Install the library using pip: pip install google-cloud-bigquery-storage
error ImportError: cannot import name 'bigquery_storage_v1beta1' from 'google.cloud'
cause This error typically occurs when trying to import an older, deprecated version of the BigQuery Storage API client (`v1beta1`) after upgrading the `google-cloud-bigquery-storage` library to a newer version (2.x or later), which primarily uses the `v1` or top-level `bigquery_storage` namespace.
fix
Update your import statements to use the current API version or the top-level namespace: from google.cloud.bigquery_storage import BigQueryReadClient, types or from google.cloud import bigquery_storage
error ValueError: The pyarrow library is not installed, please install pyarrow to use the to_arrow() function.
cause The `google-cloud-bigquery-storage` library relies on `pyarrow` for efficient data serialization when working with Apache Arrow format or converting results to Pandas DataFrames, but `pyarrow` is not installed as a dependency.
fix
Install the pyarrow library: pip install pyarrow or install with the BigQuery client extra: pip install google-cloud-bigquery[bqstorage,pandas]
error AttributeError: 'NoneType' object has no attribute '_parse_avro_schema'
cause This error can occur when using `reader.to_dataframe()` on a `ReadRowsIterable` object that returns no rows, or when there's an issue parsing the Avro schema, sometimes related to specific data types or filters on the table.
fix
Ensure that the query or read session is expected to return data. If no data is expected, handle the case where the ReadRowsIterable might be empty before attempting to convert to a DataFrame. This might involve checking for data presence or wrapping the conversion in a try-except block. Additionally, verify fastavro is installed as it's required for Avro serialization, pip install fastavro.
breaking Major breaking changes occurred in version 2.0.0. The primary import path for clients changed from `google.cloud.bigquery_storage_v1` to `google.cloud.bigquery_storage`. Enum types moved from direct import or client access (e.g., `BigQueryReadClient.enums`) to the `types` module (e.g., `types.DataFormat.ARROW`). Existing code using the `_v1` suffix or direct enum access will fail.
fix Update import statements to `from google.cloud.bigquery_storage import ...` and access enums via `from google.cloud.bigquery_storage import types`.
deprecated The `client_config` and `channel` parameters for client constructors have been removed.
fix Remove these parameters from client instantiation. Customize retry and timeout settings directly when invoking methods (e.g., `client.create_read_session(..., timeout=60)`).
gotcha The BigQuery Storage API is optimized for high-throughput, large-scale data transfer, not for small, interactive queries or single-row lookups. Using it for small datasets may introduce unnecessary overhead compared to the standard BigQuery client library.
fix Use the `google-cloud-bigquery` client library for standard SQL queries, small data extractions, or metadata operations. Reserve `google-cloud-bigquery-storage` for high-volume data ingestion or extraction workflows.
gotcha Data read from the Storage API is returned in a binary format (Protobuf or Apache Arrow). To easily work with this data in Python, you typically need to convert it. This often requires additional dependencies like `pandas` and `pyarrow` (for `to_dataframe()`) or `fastavro` (for `rows()` to get dicts from AVRO).
fix Install `google-cloud-bigquery-storage` with appropriate extras (e.g., `pip install google-cloud-bigquery-storage[fastavro,pandas,pyarrow]`) and use methods like `reader.to_dataframe()` or `reader.rows()`.
gotcha While `BigQueryReadClient.create_read_session` allows specifying `max_stream_count` for parallelism, achieving true concurrent data processing in Python often requires using the `multiprocessing` module rather than simple threading due to Python's Global Interpreter Lock (GIL).
fix For optimal parallel performance when reading multiple streams, consider using Python's `multiprocessing` module or an asynchronous framework if your I/O operations are truly non-blocking across network requests.
gotcha The `google-cloud-bigquery` client library, which often complements `google-cloud-bigquery-storage`, has ended support for Python 3.7 and 3.8. Although `google-cloud-bigquery-storage` officially supports Python >=3.7, it is highly recommended to upgrade to Python 3.9+ to maintain compatibility with the broader Google Cloud client ecosystem and ensure ongoing support.
fix Upgrade your Python environment to 3.9 or higher.
gotcha Most Google Cloud client libraries, including `google-cloud-bigquery-storage`, require proper authentication. In many development environments, this is achieved by setting the `GOOGLE_CLOUD_PROJECT` environment variable to specify the project ID, or by providing explicit credentials (e.g., via `GOOGLE_APPLICATION_CREDENTIALS`). Failure to do so will result in authentication errors or errors like 'GOOGLE_CLOUD_PROJECT environment variable must be set'.
fix Ensure the `GOOGLE_CLOUD_PROJECT` environment variable is set to your Google Cloud project ID (e.g., `export GOOGLE_CLOUD_PROJECT='your-project-id'`) before running your application. Alternatively, explicitly pass project credentials to the client constructor, or ensure `GOOGLE_APPLICATION_CREDENTIALS` points to a valid service account key file.
gotcha Google Cloud client libraries often require a project ID for operations. If not explicitly provided during client instantiation or via default credentials (e.g., service account JSON), the library typically looks for the `GOOGLE_CLOUD_PROJECT` environment variable. Failure to set this will result in a `ValueError`.
fix Ensure the `GOOGLE_CLOUD_PROJECT` environment variable is set to your Google Cloud project ID, or explicitly pass the `project` argument to the client constructor (e.g., `BigQueryReadClient(project='your-project-id')`).
pip install google-cloud-bigquery-storage[fastavro,pandas,pyarrow]
python os / libc variant status wheel install import disk
3.10 alpine (musl) google-cloud-bigquery-storage wheel - 1.65s 70.2M
3.10 alpine (musl) fastavro,pandas,pyarrow wheel - 2.76s 400.0M
3.10 alpine (musl) google-cloud-bigquery-storage - - 1.54s 69.0M
3.10 alpine (musl) fastavro,pandas,pyarrow - - 2.65s 393.8M
3.10 slim (glibc) google-cloud-bigquery-storage wheel 6.5s 1.04s 68M
3.10 slim (glibc) fastavro,pandas,pyarrow wheel 14.8s 1.98s 369M
3.10 slim (glibc) google-cloud-bigquery-storage - - 1.02s 67M
3.10 slim (glibc) fastavro,pandas,pyarrow - - 1.83s 364M
3.11 alpine (musl) google-cloud-bigquery-storage wheel - 2.23s 74.9M
3.11 alpine (musl) fastavro,pandas,pyarrow wheel - 3.83s 418.3M
3.11 alpine (musl) google-cloud-bigquery-storage - - 2.54s 73.7M
3.11 alpine (musl) fastavro,pandas,pyarrow - - 4.32s 412.0M
3.11 slim (glibc) google-cloud-bigquery-storage wheel 5.1s 1.57s 73M
3.11 slim (glibc) fastavro,pandas,pyarrow wheel 13.9s 2.74s 387M
3.11 slim (glibc) google-cloud-bigquery-storage - - 1.55s 71M
3.11 slim (glibc) fastavro,pandas,pyarrow - - 2.84s 382M
3.12 alpine (musl) google-cloud-bigquery-storage wheel - 2.38s 66.3M
3.12 alpine (musl) fastavro,pandas,pyarrow wheel - 3.51s 403.0M
3.12 alpine (musl) google-cloud-bigquery-storage - - 2.53s 65.2M
3.12 alpine (musl) fastavro,pandas,pyarrow - - 4.74s 396.7M
3.12 slim (glibc) google-cloud-bigquery-storage wheel 4.6s 1.87s 64M
3.12 slim (glibc) fastavro,pandas,pyarrow wheel 12.6s 3.19s 371M
3.12 slim (glibc) google-cloud-bigquery-storage - - 2.28s 63M
3.12 slim (glibc) fastavro,pandas,pyarrow - - 4.18s 366M
3.13 alpine (musl) google-cloud-bigquery-storage wheel - 2.35s 66.0M
3.13 alpine (musl) fastavro,pandas,pyarrow wheel - 3.28s 401.8M
3.13 alpine (musl) google-cloud-bigquery-storage - - 2.59s 64.8M
3.13 alpine (musl) fastavro,pandas,pyarrow - - 4.56s 395.4M
3.13 slim (glibc) google-cloud-bigquery-storage wheel 5.0s 1.78s 64M
3.13 slim (glibc) fastavro,pandas,pyarrow wheel 12.9s 2.98s 370M
3.13 slim (glibc) google-cloud-bigquery-storage - - 2.15s 62M
3.13 slim (glibc) fastavro,pandas,pyarrow - - 4.22s 365M
3.9 alpine (musl) google-cloud-bigquery-storage wheel - 1.58s 70.2M
3.9 alpine (musl) fastavro,pandas,pyarrow wheel - 2.77s 389.5M
3.9 alpine (musl) google-cloud-bigquery-storage - - 1.42s 69.1M
3.9 alpine (musl) fastavro,pandas,pyarrow - - 2.44s 388.6M
3.9 slim (glibc) google-cloud-bigquery-storage wheel 7.0s 1.30s 68M
3.9 slim (glibc) fastavro,pandas,pyarrow wheel 17.2s 2.32s 367M
3.9 slim (glibc) google-cloud-bigquery-storage - - 1.11s 67M
3.9 slim (glibc) fastavro,pandas,pyarrow - - 2.03s 366M

This quickstart demonstrates how to use `BigQueryReadClient` to read data from a public BigQuery table using the Storage API. It configures a read session, reads data in Apache Arrow format, and attempts to convert it to a Pandas DataFrame. Remember to set the `GOOGLE_CLOUD_PROJECT` environment variable and ensure the BigQuery Storage API is enabled for your project.

import os
from google.cloud.bigquery_storage import BigQueryReadClient, types

# Your Google Cloud project ID. If not set, it will default to the project
# defined in your environment (e.g., GOOGLE_CLOUD_PROJECT, gcloud config).
project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', '')
if not project_id:
    raise ValueError("GOOGLE_CLOUD_PROJECT environment variable must be set")

# Public BigQuery dataset and table to read from
parent = f"projects/{project_id}"
dataset_id = "google_trends"
table_id = "international_top_rising_terms"
table = f"projects/bigquery-public-data/datasets/{dataset_id}/tables/{table_id}"

def read_bigquery_table_storage(project_id, table):
    """Reads data from a BigQuery table using the BigQuery Storage Read API."""
    client = BigQueryReadClient()

    # Specify the table and desired data format (Arrow recommended for performance)
    read_options = types.ReadSession.TableReadOptions(selected_fields=["country_name", "region_name"])
    requested_session = types.ReadSession(
        table=table,
        data_format=types.DataFormat.ARROW, # Or types.DataFormat.AVRO
        read_options=read_options,
    )

    # Create a read session
    read_session = client.create_read_session(
        parent=parent,
        read_session=requested_session,
        max_stream_count=1 # Adjust for parallelism if needed
    )

    print(f"Read session created: {read_session.name}")

    # Read from the first stream (assuming max_stream_count=1)
    stream_name = read_session.streams[0].name
    reader = client.read_rows(stream_name)

    # Convert to Pandas DataFrame (requires pandas and pyarrow installed)
    import pandas as pd
    try:
        dataframe = reader.to_dataframe()
        print("Successfully read data into Pandas DataFrame.")
        print(dataframe.head())
        return dataframe
    except ImportError as e:
        print(f"Could not convert to DataFrame: {e}. Try iterating rows manually.")
        print("Reading rows directly:")
        for row_message in reader.rows(session=read_session):
            # row_message will be a dict if fastavro is installed and format is AVRO, 
            # or a protobuf message if format is ARROW (requires manual parsing for dicts)
            print(row_message)
            break # Print first row and exit

if __name__ == "__main__":
    # Ensure you have authenticated to GCP (e.g., `gcloud auth application-default login`)
    # and enabled the BigQuery Storage API for your project.
    # Replace 'your-gcp-project-id' with your actual project ID or set GOOGLE_CLOUD_PROJECT env var.
    # Example uses a public dataset, so you mainly need read access to your billing project.
    df = read_bigquery_table_storage(project_id, table)