Google Cloud Dataproc Client Library

5.25.0 · active · verified Sun Mar 29

The Google Cloud Dataproc client library for Python provides programmatic access to Google Cloud's fully managed service for Apache Spark and Apache Hadoop. It simplifies running open-source data processing frameworks without the operational overhead of manual cluster provisioning and monitoring. The library is actively maintained with frequent updates, aligning with the rapid development of the underlying Dataproc service.

Warnings

Install

Imports

Quickstart

This quickstart initializes the Dataproc `ClusterControllerClient` and attempts to list existing clusters in a specified Google Cloud project and region. It demonstrates basic client setup using Application Default Credentials, which is the recommended authentication method. Users should set `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION` environment variables or explicitly provide their project ID and desired region.

import os
from google.cloud.dataproc_v1 import ClusterControllerClient

# Set your Google Cloud Project ID and Region
project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', 'your-project-id')
region = os.environ.get('GOOGLE_CLOUD_REGION', 'us-central1')

try:
    # Initialize the client. This typically uses Application Default Credentials.
    # For local development, ensure `gcloud auth application-default login` has been run.
    # The API endpoint is regional, e.g., 'us-central1-dataproc.googleapis.com:443'
    client = ClusterControllerClient(client_options={
        "api_endpoint": f"{region}-dataproc.googleapis.com:443"
    })

    print(f"Successfully initialized Dataproc ClusterControllerClient for project: {project_id} in region: {region}")

    # Example: List clusters (requires Dataproc API enabled and appropriate IAM permissions)
    print(f"Listing clusters in project {project_id} and region {region}:")
    for cluster in client.list_clusters(project_id=project_id, region=region):
        print(f"- {cluster.cluster_name} (Status: {cluster.status.state.name})")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION environment variables are set,")
    print("or replace 'your-project-id' and 'us-central1' with actual values.")
    print("Also, verify that the Dataproc API is enabled and you have necessary IAM permissions.")

view raw JSON →