Google Cloud Dataproc Client Library
The Google Cloud Dataproc client library for Python provides programmatic access to Google Cloud's fully managed service for Apache Spark and Apache Hadoop. It simplifies running open-source data processing frameworks without the operational overhead of manual cluster provisioning and monitoring. The library is actively maintained with frequent updates, aligning with the rapid development of the underlying Dataproc service.
Warnings
- breaking Dataproc prevents the creation of clusters with image versions prior to 2.0.27 due to Apache Log4j security vulnerabilities. Users should always create clusters with the latest available sub-minor image versions to ensure security and receive support.
- gotcha For local development and running code, it is recommended to use service account credentials or Application Default Credentials (via `gcloud auth application-default login`) rather than user account credentials, especially for production environments.
- gotcha When submitting PySpark jobs to Dataproc, Python packages installed locally (e.g., via `pip install`) on the cluster VMs might not be found by PySpark if the environment is not correctly configured. This often leads to `ImportError`.
- gotcha The `google-cloud-dataproc` library itself requires Python >= 3.7. Attempting to use it with older Python versions will result in installation or runtime errors.
- deprecated Dataproc Serverless for Apache Spark (previously known as Dataproc Serverless) is now generally available directly within BigQuery. While the Dataproc client library can still manage serverless batches, new development might prefer the BigQuery integration for Spark workloads.
Install
-
pip install google-cloud-dataproc
Imports
- DataprocClient
from google.cloud.dataproc_v1 import ClusterControllerClient
Quickstart
import os
from google.cloud.dataproc_v1 import ClusterControllerClient
# Set your Google Cloud Project ID and Region
project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', 'your-project-id')
region = os.environ.get('GOOGLE_CLOUD_REGION', 'us-central1')
try:
# Initialize the client. This typically uses Application Default Credentials.
# For local development, ensure `gcloud auth application-default login` has been run.
# The API endpoint is regional, e.g., 'us-central1-dataproc.googleapis.com:443'
client = ClusterControllerClient(client_options={
"api_endpoint": f"{region}-dataproc.googleapis.com:443"
})
print(f"Successfully initialized Dataproc ClusterControllerClient for project: {project_id} in region: {region}")
# Example: List clusters (requires Dataproc API enabled and appropriate IAM permissions)
print(f"Listing clusters in project {project_id} and region {region}:")
for cluster in client.list_clusters(project_id=project_id, region=region):
print(f"- {cluster.cluster_name} (Status: {cluster.status.state.name})")
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION environment variables are set,")
print("or replace 'your-project-id' and 'us-central1' with actual values.")
print("Also, verify that the Dataproc API is enabled and you have necessary IAM permissions.")