Google Cloud Dataproc Client Library

raw JSON →
5.25.0 verified Tue May 12 auth: no python install: verified quickstart: verified

The Google Cloud Dataproc client library for Python provides programmatic access to Google Cloud's fully managed service for Apache Spark and Apache Hadoop. It simplifies running open-source data processing frameworks without the operational overhead of manual cluster provisioning and monitoring. The library is actively maintained with frequent updates, aligning with the rapid development of the underlying Dataproc service.

pip install google-cloud-dataproc
error ModuleNotFoundError: No module named 'your_module'
cause When running PySpark jobs on Dataproc, custom Python modules or required Google Cloud client libraries are not correctly packaged or installed on the cluster, or the Python path is not configured to include them. This also includes specific imports like `from google.protobuf.empty_pb2 import Empty` that may fail if the `protobuf` library has installation issues or version conflicts.
fix
For custom Python modules, package them into a .zip file and submit the job using the gcloud dataproc jobs submit pyspark command with the --py-files flag (e.g., --py-files=gs://BUCKET/Python_Proj.zip). If using the Cloud Console, provide the .zip file in the 'Archive files' field. For Google Cloud client libraries, ensure they are installed on the cluster by specifying them with the --packages flag during job submission (e.g., --packages=google-cloud-secret-manager) or by using initialization actions to pip install them with --user into the cluster nodes. If encountering issues with google.protobuf.empty_pb2, try reinstalling the protobuf library to resolve potential version conflicts or incomplete installations.
error google.api_core.exceptions.InvalidArgument: 400 Cluster name is required
cause This error occurs when a Dataproc API call (e.g., submitting a job) is missing the `clusterName` parameter within the `JobPlacement` configuration of the job request.
fix
Ensure that the job configuration dictionary explicitly defines the cluster name within the placement field. For example, your job dictionary should include job = {'placement': {'cluster_name': 'your-cluster-name'}, ...}.
error User does not have permission to...
cause This broad category of errors, often accompanied by a '403 Forbidden' status, indicates that the service account or user attempting to perform a Dataproc operation (such as creating a cluster, submitting a job, or accessing GCS resources) lacks the necessary Identity and Access Management (IAM) roles or permissions. Common specific issues include missing `compute.subnetworks.use` for shared VPCs, `storage.objectViewer` or `storage.objectCreator` for GCS buckets, or `dataproc.agent.serviceAgent` roles.
fix
Grant the required IAM roles to the service account or user in the Google Cloud project. Review the Dataproc documentation for the specific permissions needed for your operation. For cluster creation, roles like roles/dataproc.editor and roles/compute.networkUser (if using shared VPC) are often necessary. For accessing GCS, roles/storage.objectViewer and roles/storage.objectCreator on the relevant buckets are typically needed.
error google.api_core.exceptions.NotFound: 404 Can not copy from "gs://.../dependencies.tar.gz"
cause This error signifies that a specified Google Cloud Storage (GCS) resource, such as a script, a dependency archive, or input/output data, could not be found or accessed by the Dataproc service. This can be due to an incorrect GCS path, the file not existing, or the Dataproc service account lacking read permissions to the GCS object.
fix
Verify that the GCS path provided in your job submission or cluster configuration is correct and that the object (file or directory) actually exists in the specified bucket. Additionally, ensure that the Dataproc service account has storage.objectViewer permission on the GCS bucket and its contents to allow the cluster to read the necessary files.
breaking Dataproc prevents the creation of clusters with image versions prior to 2.0.27 due to Apache Log4j security vulnerabilities. Users should always create clusters with the latest available sub-minor image versions to ensure security and receive support.
fix Specify a Dataproc image version of 2.0.27 or newer when creating clusters. Always use the latest recommended sub-minor image version.
gotcha For local development and running code, it is recommended to use service account credentials or Application Default Credentials (via `gcloud auth application-default login`) rather than user account credentials, especially for production environments.
fix Configure Application Default Credentials by running `gcloud auth application-default login` or set `GOOGLE_APPLICATION_CREDENTIALS` to a service account key file for local development.
gotcha When submitting PySpark jobs to Dataproc, Python packages installed locally (e.g., via `pip install`) on the cluster VMs might not be found by PySpark if the environment is not correctly configured. This often leads to `ImportError`.
fix Ensure the PySpark environment uses the correct Python interpreter. This can involve setting `PYSPARK_PYTHON`, using initialization actions to install packages into a specific Conda environment, or packaging dependencies correctly with your job submission using `--py-files` or `--jars`.
gotcha The `google-cloud-dataproc` library itself requires Python >= 3.7. Attempting to use it with older Python versions will result in installation or runtime errors.
fix Ensure your Python environment is version 3.7 or newer.
deprecated Dataproc Serverless for Apache Spark (previously known as Dataproc Serverless) is now generally available directly within BigQuery. While the Dataproc client library can still manage serverless batches, new development might prefer the BigQuery integration for Spark workloads.
fix Consider leveraging the BigQuery integration for Serverless Spark workloads for potentially simplified development and deployment, especially if already using BigQuery.
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 2.34s 73.7M
3.10 alpine (musl) - - 2.26s 72.6M
3.10 slim (glibc) wheel 6.5s 1.14s 71M
3.10 slim (glibc) - - 1.12s 70M
3.11 alpine (musl) wheel - 2.83s 79.0M
3.11 alpine (musl) - - 3.23s 77.9M
3.11 slim (glibc) wheel 5.5s 1.84s 77M
3.11 slim (glibc) - - 1.78s 76M
3.12 alpine (musl) wheel - 2.84s 70.4M
3.12 alpine (musl) - - 3.15s 69.3M
3.12 slim (glibc) wheel 4.5s 2.00s 68M
3.12 slim (glibc) - - 2.23s 67M
3.13 alpine (musl) wheel - 2.66s 70.0M
3.13 alpine (musl) - - 3.07s 68.8M
3.13 slim (glibc) wheel 4.7s 1.96s 68M
3.13 slim (glibc) - - 2.42s 67M
3.9 alpine (musl) wheel - 2.09s 73.8M
3.9 alpine (musl) - - 2.08s 72.7M
3.9 slim (glibc) wheel 7.5s 1.43s 72M
3.9 slim (glibc) - - 1.22s 70M

This quickstart initializes the Dataproc `ClusterControllerClient` and attempts to list existing clusters in a specified Google Cloud project and region. It demonstrates basic client setup using Application Default Credentials, which is the recommended authentication method. Users should set `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION` environment variables or explicitly provide their project ID and desired region.

import os
from google.cloud.dataproc_v1 import ClusterControllerClient

# Set your Google Cloud Project ID and Region
project_id = os.environ.get('GOOGLE_CLOUD_PROJECT', 'your-project-id')
region = os.environ.get('GOOGLE_CLOUD_REGION', 'us-central1')

try:
    # Initialize the client. This typically uses Application Default Credentials.
    # For local development, ensure `gcloud auth application-default login` has been run.
    # The API endpoint is regional, e.g., 'us-central1-dataproc.googleapis.com:443'
    client = ClusterControllerClient(client_options={
        "api_endpoint": f"{region}-dataproc.googleapis.com:443"
    })

    print(f"Successfully initialized Dataproc ClusterControllerClient for project: {project_id} in region: {region}")

    # Example: List clusters (requires Dataproc API enabled and appropriate IAM permissions)
    print(f"Listing clusters in project {project_id} and region {region}:")
    for cluster in client.list_clusters(project_id=project_id, region=region):
        print(f"- {cluster.cluster_name} (Status: {cluster.status.state.name})")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION environment variables are set,")
    print("or replace 'your-project-id' and 'us-central1' with actual values.")
    print("Also, verify that the Dataproc API is enabled and you have necessary IAM permissions.")