Google Cloud Dataflow Client
The `google-cloud-dataflow-client` is the Python client library for interacting with the Google Cloud Dataflow API. It allows users to programmatically manage Dataflow jobs, such as listing existing jobs, getting job details, or submitting new ones based on templates. This library provides an interface to the Dataflow service API, distinct from the `apache-beam` SDK, which is used for defining Dataflow pipelines themselves. It is part of the broader `google-cloud-python` ecosystem and generally follows its release cadence, receiving regular updates for bug fixes and new features.
Warnings
- gotcha This library (`google-cloud-dataflow-client`) is for *managing* Dataflow jobs and interacting with the Dataflow service. It is not for *defining* data processing pipelines. Pipeline definition is done using the Apache Beam SDK (`apache-beam` library).
- gotcha Google Cloud client libraries often expose API versions (e.g., `v1beta3`) directly in their import paths. Using an incorrect or deprecated API version in your import statement (e.g., `from google.cloud.dataflow_v1` instead of `dataflow_v1beta3`) can lead to `ImportError` or unexpected API behavior.
- gotcha Dataflow operations are regional. When interacting with the Dataflow service, it is crucial to specify the correct `location` (region) for listing or creating jobs. Failing to do so may result in not finding expected jobs or encountering errors.
- deprecated The Dataflow API `v1beta3` is a beta release and, as such, its methods and types are subject to backward-incompatible changes without a long deprecation period. While generally stable, rely on `beta` components with caution in production environments.
Install
-
pip install google-cloud-dataflow-client
Imports
- JobsClient
from google.cloud.dataflow_v1beta3 import JobsClient
- ListJobsRequest
from google.cloud.dataflow_v1beta3.types import ListJobsRequest
Quickstart
import os
from google.cloud.dataflow_v1beta3 import JobsClient
from google.cloud.dataflow_v1beta3.types import ListJobsRequest
# Set your Google Cloud Project ID (e.g., via GOOGLE_CLOUD_PROJECT environment variable)
project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "your-gcp-project-id")
region = "us-central1" # Dataflow jobs are regional, specify the region where your jobs run
if project_id == "your-gcp-project-id":
print("WARNING: Please set the GOOGLE_CLOUD_PROJECT environment variable or replace 'your-gcp-project-id'.")
exit()
try:
# Initialize the client (will use Application Default Credentials by default)
client = JobsClient()
# Create a request to list jobs in a specific project and region
request = ListJobsRequest(project_id=project_id, location=region)
print(f"Listing Dataflow jobs for project '{project_id}' in region '{region}':")
response = client.list_jobs(request=request)
jobs_found = False
for job in response.jobs:
print(f" Job ID: {job.id}, Name: {job.name}, Type: {job.type}, State: {job.current_state}")
jobs_found = True
if not jobs_found:
print(" No Dataflow jobs found.")
except Exception as e:
print(f"An error occurred: {e}")
print("Ensure you have authenticated with `gcloud auth application-default login` and enabled the Dataflow API in your project.")