Google Cloud Dataproc Metastore API client library
Google Cloud Dataproc Metastore is a fully managed, highly available, autohealing, and serverless Apache Hive metastore (HMS) that runs on Google Cloud. It simplifies technical metadata management for data lakes and provides interoperability between various data processing engines like Apache Hive, Apache Spark, and Presto. The `google-cloud-dataproc-metastore` Python client library allows developers to programmatically interact with this service. This library is part of the broader `google-cloud-python` monorepo, which typically sees frequent releases, often weekly for various client libraries.
Warnings
- gotcha Dataproc Metastore offers two service versions: Dataproc Metastore 1 and Dataproc Metastore 2. Version 2 provides horizontal scalability and has a different pricing model. When creating or configuring services, ensure you are aware of which version you intend to use as it impacts features and cost.
- breaking Incompatible Dataproc or Hive Metastore versions can lead to issues. Specifically, Dataproc 3.x versions are incompatible with Dataproc Metastore. Using Dataproc 1.5 with Dataproc Metastore 3.1.2 may also result in backward compatibility problems.
- gotcha Dataproc Metastore services can expose either Apache Thrift or gRPC endpoints. While Thrift is widely used, gRPC is often recommended for integration with newer Google Cloud services like Dataplex. The chosen endpoint protocol must match how clients connect to the service.
- gotcha Proper authentication is critical for connecting to Google Cloud services. A common footgun is forgetting to set up Application Default Credentials or providing appropriate IAM roles for the service account/user.
Install
-
pip install google-cloud-dataproc-metastore
Imports
- DataprocMetastoreClient
from google.cloud.metastore_v1.services.dataproc_metastore import DataprocMetastoreClient
- MetastoreService
from google.cloud.metastore_v1.types import MetastoreService
- ListServicesRequest
from google.cloud.metastore_v1.types import ListServicesRequest
Quickstart
import os
from google.cloud.metastore_v1.services.dataproc_metastore import DataprocMetastoreClient
from google.cloud.metastore_v1.types import ListServicesRequest
def list_metastore_services(project_id: str, location: str) -> None:
"""Lists Dataproc Metastore services in a given project and location.
Args:
project_id: Your Google Cloud project ID.
location: The Google Cloud location (e.g., 'us-central1').
"""
# Instantiates a client
client = DataprocMetastoreClient()
# The resource name of the location where the services are located.
# Example: "projects/my-project/locations/us-central1"
parent = f"projects/{project_id}/locations/{location}"
# Construct the request
request = ListServicesRequest(parent=parent)
# Call the API
try:
page_result = client.list_services(request=request)
print(f"Dataproc Metastore services in {parent}:")
found_services = False
for service in page_result:
print(f"- {service.name} (State: {service.state.name})")
found_services = True
if not found_services:
print(" No Dataproc Metastore services found.")
except Exception as e:
print(f"Error listing services: {e}")
print("Ensure the API is enabled, credentials are set, and the location is valid.")
# To run this quickstart:
# 1. Ensure `gcloud auth application-default login` has been run or `GOOGLE_APPLICATION_CREDENTIALS` is set.
# 2. Set the `GOOGLE_CLOUD_PROJECT` environment variable to your project ID.
# 3. Set the `GOOGLE_CLOUD_LOCATION` environment variable to your desired location (e.g., "us-central1").
# Example usage:
# GOOGLE_CLOUD_PROJECT='your-project-id' GOOGLE_CLOUD_LOCATION='us-central1' python your_script_name.py
if __name__ == "__main__":
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "")
location = os.environ.get("GOOGLE_CLOUD_LOCATION", "")
if not project_id:
print("Please set the GOOGLE_CLOUD_PROJECT environment variable.")
elif not location:
print("Please set the GOOGLE_CLOUD_LOCATION environment variable.")
else:
list_metastore_services(project_id, location)