{"id":9873,"library":"kubeflow","title":"Kubeflow Python SDK","description":"The Kubeflow Python SDK (current version 0.4.0) provides a client library to programmatically manage machine learning workloads and interact with various Kubeflow APIs. It allows users to define, create, monitor, and delete training jobs (e.g., PyTorchJob, TFJob), hyperparameter optimization jobs (Katib), and other ML-related resources directly from Python. Releases are frequent, typically focusing on new features and bug fixes across minor versions.","status":"active","version":"0.4.0","language":"en","source_language":"en","source_url":"https://github.com/kubeflow/sdk","tags":["machine learning","kubernetes","mlops","orchestration","pytorch","tensorflow","katib"],"install":[{"cmd":"pip install kubeflow","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"symbol":"TrainerClient","correct":"from kubeflow.sdk.training import TrainerClient"},{"symbol":"TrainingJob","correct":"from kubeflow.sdk.training.api import TrainingJob"},{"symbol":"V1PyTorchJob","correct":"from kubeflow.sdk.training.models import V1PyTorchJob"},{"note":"OptimizerClient is for Katib (HPO) and is in a separate submodule from TrainerClient.","wrong":"from kubeflow.sdk.training import OptimizerClient","symbol":"OptimizerClient","correct":"from kubeflow.sdk.optimizer import OptimizerClient"}],"quickstart":{"code":"import os\nfrom kubeflow.sdk.training import TrainerClient\nfrom kubeflow.sdk.training.api import TrainingJob\nfrom kubeflow.sdk.training.models import V1PyTorchJob, V1RunPolicy\n\n# NOTE: This example requires a running Kubeflow cluster and configured kubectl context.\n# It will create a PyTorch training job in the 'kubeflow' namespace.\n# Define your training job\ntraining_job = TrainingJob(\n    api_version=\"kubeflow.org/v1\",\n    kind=\"PyTorchJob\",\n    metadata={\n        \"name\": os.environ.get('KF_JOB_NAME', 'my-pytorch-job'), \n        \"namespace\": os.environ.get('KF_NAMESPACE', 'kubeflow')\n    },\n    spec=V1PyTorchJob(\n        pytorch_replica_specs={\n            \"Worker\": {\n                \"replicas\": 1,\n                \"restartPolicy\": \"OnFailure\",\n                \"template\": {\n                    \"spec\": {\n                        \"containers\": [\n                            {\n                                \"name\": \"pytorch\",\n                                \"image\": \"pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime\",\n                                \"command\": [\"python\", \"-c\", \"print('Hello Kubeflow!')\"],\n                            }\n                        ]\n                    }\n                },\n            }\n        },\n        run_policy=V1RunPolicy(clean_pod_policy=\"All\"),\n    ),\n)\n\n# Initialize the TrainerClient\ntry:\n    trainer_client = TrainerClient()\n\n    # Create the training job on the Kubeflow cluster\n    created_job = trainer_client.create_job(job=training_job)\n    print(f\"Job '{created_job.metadata.name}' created in namespace '{created_job.metadata.namespace}'.\")\n\n    # Wait for job completion (optional, can block)\n    # trainer_client.wait_for_job_completion(name=created_job.metadata.name, namespace=created_job.metadata.namespace)\n    # print(f\"Job '{created_job.metadata.name}' completed.\")\n\n    # Get job status (optional)\n    status = trainer_client.get_job_status(name=created_job.metadata.name, namespace=created_job.metadata.namespace)\n    print(f\"Job status: {status.state}\")\n\n    # Delete the job (optional, uncomment to enable)\n    # trainer_client.delete_job(name=created_job.metadata.name, namespace=created_job.metadata.namespace)\n    # print(f\"Job '{created_job.metadata.name}' deleted.\")\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n    print(\"Ensure your kubectl context is correctly configured and pointing to a Kubeflow cluster.\")","lang":"python","description":"This quickstart demonstrates how to create a simple PyTorch training job using the Kubeflow SDK's `TrainerClient`. It defines a `PyTorchJob` spec and submits it to a Kubeflow cluster. This code requires an active Kubeflow deployment and a properly configured `kubectl` context to run successfully."},"warnings":[{"fix":"Migrate your training job definitions to use `RuntimePatches` instead of `PodTemplateOverrides` for advanced pod customization.","message":"The `PodTemplateOverrides` API for custom pod modifications in training jobs has been replaced by `RuntimePatches` starting from Kubeflow SDK v0.4.0.","severity":"breaking","affected_versions":"0.4.0 and later"},{"fix":"Ensure `kubectl` is installed and configured to connect to your Kubeflow cluster. Verify connectivity with `kubectl get pods -n kubeflow`.","message":"The Kubeflow SDK client interacts with a remote Kubeflow cluster. Your Python environment needs a configured `kubectl` context (e.g., ~/.kube/config) pointing to a running Kubeflow instance.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Check the official Kubeflow documentation for recommended SDK versions compatible with your cluster's Kubeflow deployment. Upgrade/downgrade the SDK or your cluster components as needed.","message":"There can be compatibility issues between the Kubeflow SDK version and the version of Kubeflow deployed on your cluster, especially with CRD (Custom Resource Definition) versions for training operators (e.g., PyTorchJob, TFJob).","severity":"gotcha","affected_versions":"All versions"},{"fix":"Use `from kubeflow.sdk.training import TrainerClient` for training jobs and `from kubeflow.sdk.optimizer import OptimizerClient` for Katib (HPO) jobs.","message":"The `TrainerClient` and `OptimizerClient` are distinct and manage different components of Kubeflow. Ensure you are importing and using the correct client for training jobs (`TrainerClient`) versus hyperparameter optimization jobs (`OptimizerClient`).","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Install the package using `pip install kubeflow`. Ensure your imports are `from kubeflow.sdk.<submodule> import <Symbol>`.","cause":"The 'kubeflow' package is not installed or the import path is incorrect.","error":"ModuleNotFoundError: No module named 'kubeflow.sdk'"},{"fix":"Verify the namespace and job name. Check your `kubectl` context with `kubectl config current-context` and ensure it can access the Kubeflow cluster. Also, verify that the required CRDs (e.g., 'pytorchjobs.kubeflow.org') exist on the cluster.","cause":"The Kubeflow API server could not find the specified resource (e.g., job, namespace) or the kubectl context is incorrectly configured.","error":"ApiException: (404) Reason: Not Found"},{"fix":"Add error handling around API calls. Inspect the `training_job` definition for correctness. Check the logs of the Kubeflow API server or the relevant operator (e.g., `pytorch-operator`) for more specific errors.","cause":"This usually happens when `create_job` or `get_job` returns `None` because the job creation failed or the job could not be retrieved, often due to an underlying API error or malformed job spec.","error":"AttributeError: 'NoneType' object has no attribute 'metadata'"}]}