Google Cloud Dataflow Client

raw JSON →
0.11.0 verified Tue May 12 auth: no python install: stale quickstart: stale

The `google-cloud-dataflow-client` is the Python client library for interacting with the Google Cloud Dataflow API. It allows users to programmatically manage Dataflow jobs, such as listing existing jobs, getting job details, or submitting new ones based on templates. This library provides an interface to the Dataflow service API, distinct from the `apache-beam` SDK, which is used for defining Dataflow pipelines themselves. It is part of the broader `google-cloud-python` ecosystem and generally follows its release cadence, receiving regular updates for bug fixes and new features.

pip install google-cloud-dataflow-client
error Permission denied
cause The user account or the Dataflow service account (worker service account) lacks the necessary IAM permissions to perform the requested operation, such as creating a job, accessing Cloud Storage buckets, or writing to BigQuery.
fix
Ensure the user submitting the job has the 'Dataflow Developer' role and the worker service account (typically 'project-number-compute@developer.gserviceaccount.com' or a custom one) has at least 'Dataflow Worker', 'Storage Object Admin', and 'BigQuery Data Editor' roles, along with permissions for any other resources your pipeline interacts with.
error AttributeError: module 'google.cloud' has no attribute 'storage'
cause This error typically occurs when an Apache Beam pipeline running on Dataflow attempts to use a `google-cloud-*` client library (like `google-cloud-storage` in this case) but the corresponding Python package is not installed or correctly made available in the Dataflow worker's environment. This is common when dependencies are not properly declared or staged.
fix
Declare all external Python package dependencies in a requirements.txt file and provide it to your Dataflow job using the --requirements_file pipeline option. For more complex projects, use a setup.py file with install_requires to manage dependencies.
error Some Cloud APIs need to be enabled for your project in order for Cloud Dataflow to run this job.
cause The Google Cloud project where you are trying to run the Dataflow job has one or more required APIs disabled. Dataflow jobs depend on several other Google Cloud services, such as Compute Engine, Cloud Storage, Cloud Logging, and BigQuery.
fix
Enable all necessary Google Cloud APIs for your project, including Dataflow API, Compute Engine API, Cloud Storage API, Cloud Logging API, and any others specific to your pipeline (e.g., BigQuery API, Pub/Sub API, Datastore API). This can be done via the Google Cloud Console or gcloud services enable commands.
error Failed to start the VM, launcher-... because of status code: UNAVAILABLE, reason: One or more operations had an error: '...': [UNAVAILABLE] 'HTTP_503'.
cause This indicates that Dataflow workers (Compute Engine VMs) could not be started or initialized. Common reasons include region-specific resource exhaustion, incorrect network configuration (firewall rules, VPC settings, Private Google Access), or exceeding Compute Engine metadata limits for pipeline options.
fix
Troubleshoot by checking the job logs for more specific errors. Verify firewall rules and network configurations, ensure private IP access is correctly set up if external IPs are disabled, check for regional resource availability, and ensure the pipeline's JSON request size does not exceed Compute Engine metadata limits.
gotcha This library (`google-cloud-dataflow-client`) is for *managing* Dataflow jobs and interacting with the Dataflow service. It is not for *defining* data processing pipelines. Pipeline definition is done using the Apache Beam SDK (`apache-beam` library).
fix Use `apache-beam` for writing and defining your Dataflow pipelines. Use `google-cloud-dataflow-client` for programmatic control over the Dataflow service itself (e.g., listing, launching, or updating jobs).
gotcha Google Cloud client libraries often expose API versions (e.g., `v1beta3`) directly in their import paths. Using an incorrect or deprecated API version in your import statement (e.g., `from google.cloud.dataflow_v1` instead of `dataflow_v1beta3`) can lead to `ImportError` or unexpected API behavior.
fix Always refer to the official documentation or the library's source code for the exact and recommended import paths for the specific API version you intend to use. For `google-cloud-dataflow-client` version `0.11.0`, the primary API surface is `dataflow_v1beta3`.
gotcha Dataflow operations are regional. When interacting with the Dataflow service, it is crucial to specify the correct `location` (region) for listing or creating jobs. Failing to do so may result in not finding expected jobs or encountering errors.
fix Ensure that the `location` parameter is explicitly provided in your client calls (e.g., `client.list_jobs(project_id=project_id, location='us-central1')`) and matches the region where your Dataflow jobs are deployed or expected to run.
deprecated The Dataflow API `v1beta3` is a beta release and, as such, its methods and types are subject to backward-incompatible changes without a long deprecation period. While generally stable, rely on `beta` components with caution in production environments.
fix Monitor official Google Cloud Dataflow documentation and release notes for updates to the API and client library. Be prepared to adapt your code if `v1beta3` features are modified or a new, more stable API version becomes available and is recommended.
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - - 70.0M
3.10 alpine (musl) - - - -
3.10 slim (glibc) wheel 6.3s - 68M
3.10 slim (glibc) - - - -
3.11 alpine (musl) wheel - - 74.7M
3.11 alpine (musl) - - - -
3.11 slim (glibc) wheel 5.2s - 73M
3.11 slim (glibc) - - - -
3.12 alpine (musl) wheel - - 66.2M
3.12 alpine (musl) - - - -
3.12 slim (glibc) wheel 4.5s - 64M
3.12 slim (glibc) - - - -
3.13 alpine (musl) wheel - - 65.9M
3.13 alpine (musl) - - - -
3.13 slim (glibc) wheel 4.4s - 64M
3.13 slim (glibc) - - - -
3.9 alpine (musl) wheel - - 70.0M
3.9 alpine (musl) - - - -
3.9 slim (glibc) wheel 7.1s - 68M
3.9 slim (glibc) - - - -

This quickstart demonstrates how to initialize the Dataflow Jobs client and list existing Dataflow jobs in a specified Google Cloud project and region. Ensure your `GOOGLE_CLOUD_PROJECT` environment variable is set and you have authenticated via `gcloud auth application-default login`.

import os
from google.cloud.dataflow_v1beta3 import JobsClient
from google.cloud.dataflow_v1beta3.types import ListJobsRequest

# Set your Google Cloud Project ID (e.g., via GOOGLE_CLOUD_PROJECT environment variable)
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "your-gcp-project-id")
region = "us-central1" # Dataflow jobs are regional, specify the region where your jobs run

if project_id == "your-gcp-project-id":
    print("WARNING: Please set the GOOGLE_CLOUD_PROJECT environment variable or replace 'your-gcp-project-id'.")
    exit()

try:
    # Initialize the client (will use Application Default Credentials by default)
    client = JobsClient()

    # Create a request to list jobs in a specific project and region
    request = ListJobsRequest(project_id=project_id, location=region)

    print(f"Listing Dataflow jobs for project '{project_id}' in region '{region}':")
    response = client.list_jobs(request=request)

    jobs_found = False
    for job in response.jobs:
        print(f"  Job ID: {job.id}, Name: {job.name}, Type: {job.type}, State: {job.current_state}")
        jobs_found = True

    if not jobs_found:
        print("  No Dataflow jobs found.")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Ensure you have authenticated with `gcloud auth application-default login` and enabled the Dataflow API in your project.")