{"id":764,"library":"google-cloud-dataproc","title":"Google Cloud Dataproc Client Library","description":"The Google Cloud Dataproc client library for Python provides programmatic access to Google Cloud's fully managed service for Apache Spark and Apache Hadoop. It simplifies running open-source data processing frameworks without the operational overhead of manual cluster provisioning and monitoring. The library is actively maintained with frequent updates, aligning with the rapid development of the underlying Dataproc service.","status":"active","version":"5.25.0","language":"python","source_language":"en","source_url":"https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-dataproc","tags":["google-cloud","dataproc","big-data","spark","hadoop","etl"],"install":[{"cmd":"pip install google-cloud-dataproc","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core library for Google API clients, handling common functionalities like API requests, retries, and error handling.","package":"google-api-core","optional":false},{"reason":"Handles authentication with Google Cloud services, providing credentials management.","package":"google-auth","optional":false}],"imports":[{"note":"The primary client for managing Dataproc clusters. Other clients like JobControllerClient, WorkflowTemplateServiceClient, and BatchControllerClient are available in 'google.cloud.dataproc_v1.services'.","symbol":"DataprocClient","correct":"from google.cloud.dataproc_v1 import ClusterControllerClient"}],"quickstart":{"code":"import os\nfrom google.cloud.dataproc_v1 import ClusterControllerClient\n\n# Set your Google Cloud Project ID and Region\nproject_id = os.environ.get('GOOGLE_CLOUD_PROJECT', 'your-project-id')\nregion = os.environ.get('GOOGLE_CLOUD_REGION', 'us-central1')\n\ntry:\n    # Initialize the client. This typically uses Application Default Credentials.\n    # For local development, ensure `gcloud auth application-default login` has been run.\n    # The API endpoint is regional, e.g., 'us-central1-dataproc.googleapis.com:443'\n    client = ClusterControllerClient(client_options={\n        \"api_endpoint\": f\"{region}-dataproc.googleapis.com:443\"\n    })\n\n    print(f\"Successfully initialized Dataproc ClusterControllerClient for project: {project_id} in region: {region}\")\n\n    # Example: List clusters (requires Dataproc API enabled and appropriate IAM permissions)\n    print(f\"Listing clusters in project {project_id} and region {region}:\")\n    for cluster in client.list_clusters(project_id=project_id, region=region):\n        print(f\"- {cluster.cluster_name} (Status: {cluster.status.state.name})\")\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n    print(\"Please ensure GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION environment variables are set,\")\n    print(\"or replace 'your-project-id' and 'us-central1' with actual values.\")\n    print(\"Also, verify that the Dataproc API is enabled and you have necessary IAM permissions.\")","lang":"python","description":"This quickstart initializes the Dataproc `ClusterControllerClient` and attempts to list existing clusters in a specified Google Cloud project and region. It demonstrates basic client setup using Application Default Credentials, which is the recommended authentication method. Users should set `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_REGION` environment variables or explicitly provide their project ID and desired region."},"warnings":[{"fix":"Specify a Dataproc image version of 2.0.27 or newer when creating clusters. Always use the latest recommended sub-minor image version.","message":"Dataproc prevents the creation of clusters with image versions prior to 2.0.27 due to Apache Log4j security vulnerabilities. Users should always create clusters with the latest available sub-minor image versions to ensure security and receive support.","severity":"breaking","affected_versions":"< 2.0.27 (Dataproc image versions, not client library)"},{"fix":"Configure Application Default Credentials by running `gcloud auth application-default login` or set `GOOGLE_APPLICATION_CREDENTIALS` to a service account key file for local development.","message":"For local development and running code, it is recommended to use service account credentials or Application Default Credentials (via `gcloud auth application-default login`) rather than user account credentials, especially for production environments.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure the PySpark environment uses the correct Python interpreter. This can involve setting `PYSPARK_PYTHON`, using initialization actions to install packages into a specific Conda environment, or packaging dependencies correctly with your job submission using `--py-files` or `--jars`.","message":"When submitting PySpark jobs to Dataproc, Python packages installed locally (e.g., via `pip install`) on the cluster VMs might not be found by PySpark if the environment is not correctly configured. This often leads to `ImportError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure your Python environment is version 3.7 or newer.","message":"The `google-cloud-dataproc` library itself requires Python >= 3.7. Attempting to use it with older Python versions will result in installation or runtime errors.","severity":"gotcha","affected_versions":"< 3.7"},{"fix":"Consider leveraging the BigQuery integration for Serverless Spark workloads for potentially simplified development and deployment, especially if already using BigQuery.","message":"Dataproc Serverless for Apache Spark (previously known as Dataproc Serverless) is now generally available directly within BigQuery. While the Dataproc client library can still manage serverless batches, new development might prefer the BigQuery integration for Spark workloads.","severity":"deprecated","affected_versions":"N/A (strategic shift)"}],"env_vars":null,"last_verified":"2026-05-12T18:46:48.673Z","next_check":"2026-06-27T00:00:00.000Z","problems":[{"fix":"For custom Python modules, package them into a `.zip` file and submit the job using the `gcloud dataproc jobs submit pyspark` command with the `--py-files` flag (e.g., `--py-files=gs://BUCKET/Python_Proj.zip`). If using the Cloud Console, provide the `.zip` file in the 'Archive files' field. For Google Cloud client libraries, ensure they are installed on the cluster by specifying them with the `--packages` flag during job submission (e.g., `--packages=google-cloud-secret-manager`) or by using initialization actions to `pip install` them with `--user` into the cluster nodes. If encountering issues with `google.protobuf.empty_pb2`, try reinstalling the `protobuf` library to resolve potential version conflicts or incomplete installations.","cause":"When running PySpark jobs on Dataproc, custom Python modules or required Google Cloud client libraries are not correctly packaged or installed on the cluster, or the Python path is not configured to include them. This also includes specific imports like `from google.protobuf.empty_pb2 import Empty` that may fail if the `protobuf` library has installation issues or version conflicts.","error":"ModuleNotFoundError: No module named 'your_module'"},{"fix":"Ensure that the job configuration dictionary explicitly defines the cluster name within the `placement` field. For example, your job dictionary should include `job = {'placement': {'cluster_name': 'your-cluster-name'}, ...}`.","cause":"This error occurs when a Dataproc API call (e.g., submitting a job) is missing the `clusterName` parameter within the `JobPlacement` configuration of the job request.","error":"google.api_core.exceptions.InvalidArgument: 400 Cluster name is required"},{"fix":"Grant the required IAM roles to the service account or user in the Google Cloud project. Review the Dataproc documentation for the specific permissions needed for your operation. For cluster creation, roles like `roles/dataproc.editor` and `roles/compute.networkUser` (if using shared VPC) are often necessary. For accessing GCS, `roles/storage.objectViewer` and `roles/storage.objectCreator` on the relevant buckets are typically needed.","cause":"This broad category of errors, often accompanied by a '403 Forbidden' status, indicates that the service account or user attempting to perform a Dataproc operation (such as creating a cluster, submitting a job, or accessing GCS resources) lacks the necessary Identity and Access Management (IAM) roles or permissions. Common specific issues include missing `compute.subnetworks.use` for shared VPCs, `storage.objectViewer` or `storage.objectCreator` for GCS buckets, or `dataproc.agent.serviceAgent` roles.","error":"User does not have permission to..."},{"fix":"Verify that the GCS path provided in your job submission or cluster configuration is correct and that the object (file or directory) actually exists in the specified bucket. Additionally, ensure that the Dataproc service account has `storage.objectViewer` permission on the GCS bucket and its contents to allow the cluster to read the necessary files.","cause":"This error signifies that a specified Google Cloud Storage (GCS) resource, such as a script, a dependency archive, or input/output data, could not be found or accessed by the Dataproc service. This can be due to an incorrect GCS path, the file not existing, or the Dataproc service account lacking read permissions to the GCS object.","error":"google.api_core.exceptions.NotFound: 404 Can not copy from \"gs://.../dependencies.tar.gz\""}],"ecosystem":"pypi","meta_description":null,"install_score":95,"install_tag":"verified","quickstart_score":80,"quickstart_tag":"verified","pypi_latest":"5.27.0","cli_name":"","install_checks":{"last_tested":"2026-05-12","tag":"verified","tag_description":"installs cleanly on critical runtimes, fast import, recently tested","results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":null,"import_time_s":2.34,"mem_mb":35.1,"disk_size":"73.7M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":2.26,"mem_mb":35.2,"disk_size":"72.6M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":6.5,"import_time_s":1.14,"mem_mb":26.3,"disk_size":"71M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.12,"mem_mb":26.5,"disk_size":"70M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":null,"import_time_s":2.83,"mem_mb":37,"disk_size":"79.0M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.23,"mem_mb":37.3,"disk_size":"77.9M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":5.5,"import_time_s":1.84,"mem_mb":29,"disk_size":"77M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.78,"mem_mb":29.3,"disk_size":"76M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":null,"import_time_s":2.84,"mem_mb":36.6,"disk_size":"70.4M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.15,"mem_mb":36.9,"disk_size":"69.3M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":4.5,"import_time_s":2,"mem_mb":28.8,"disk_size":"68M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":2.23,"mem_mb":29.1,"disk_size":"67M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":null,"import_time_s":2.66,"mem_mb":37.1,"disk_size":"70.0M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":3.07,"mem_mb":37.6,"disk_size":"68.8M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":4.7,"import_time_s":1.96,"mem_mb":29.1,"disk_size":"68M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":2.42,"mem_mb":29.7,"disk_size":"67M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":null,"import_time_s":2.09,"mem_mb":35.1,"disk_size":"73.8M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":2.08,"mem_mb":34.9,"disk_size":"72.7M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":7.5,"import_time_s":1.43,"mem_mb":26.4,"disk_size":"72M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":1.22,"mem_mb":26.2,"disk_size":"70M"}]},"quickstart_checks":{"last_tested":"2026-04-24","tag":"verified","tag_description":"quickstart runs on critical runtimes, recently tested","results":[{"runtime":"python:3.10-alpine","exit_code":0},{"runtime":"python:3.10-slim","exit_code":0},{"runtime":"python:3.11-alpine","exit_code":0},{"runtime":"python:3.11-slim","exit_code":0},{"runtime":"python:3.12-alpine","exit_code":0},{"runtime":"python:3.12-slim","exit_code":0},{"runtime":"python:3.13-alpine","exit_code":0},{"runtime":"python:3.13-slim","exit_code":0},{"runtime":"python:3.9-alpine","exit_code":0},{"runtime":"python:3.9-slim","exit_code":0}]}}