TorchX SDK and Components
TorchX is a Python SDK for MLOps that helps you compose, configure, and launch PyTorch applications on various schedulers like local, Docker, Kubernetes, and Ray. It provides a common API for distributed training, serving, and other ML workloads. The current version is 0.7.0, with major releases occurring every few months.
Common errors
-
ModuleNotFoundError: No module named 'torchx.schedulers.ddp_scheduler'
cause The `ddp_scheduler` module was removed in TorchX v0.7.0.fixUpdate your code to use `ray_scheduler` for DDP workloads, or use a general scheduler like Kubernetes with appropriate configurations. For example, `runner.run(app, scheduler='ray_scheduler')`. -
Error: No such command 'run'
cause The `torchx run` CLI command was renamed to `torchx launch` in v0.6.0.fixReplace `torchx run` with `torchx launch` in your command-line invocations. -
TypeError: Role() got an unexpected keyword argument 'scheduler'
cause The `scheduler` argument was removed from `torchx.specs.Role` in v0.5.0.fixRemove the `scheduler` argument from your `specs.Role` constructor. Specify the scheduler via `TorchxRunner.run(app, scheduler='...')` instead. -
RuntimeError: Docker client is not available or daemon is not running. Please ensure docker is installed and running.
cause Attempting to use a Docker-based scheduler (e.g., `local_docker`) without a running Docker daemon.fixStart the Docker daemon on your machine. For Linux, ensure the Docker service is running (`sudo systemctl start docker`). For Docker Desktop, ensure the application is open and running.
Warnings
- breaking The `torchx.schedulers.ddp_scheduler` module was removed and replaced with the `ray_scheduler` for DDP-like workloads.
- breaking The CLI command `torchx run` was renamed to `torchx launch`.
- breaking The `scheduler` field was removed from `specs.Role`. Scheduler selection now happens at the `runner.run()` level.
- gotcha Using Docker-based schedulers (e.g., `local_docker`, `kubernetes` with container images) requires a running Docker daemon.
- gotcha TorchX components like `dist.ddp` are factory functions that return `AppDef` objects, not direct applications or classes.
Install
-
pip install torchx
Imports
- TorchxRunner
from torchx.runner import TorchxRunner
- AppDef
from torchx.specs import AppDef
- Role
from torchx.specs import Role
- Resource
from torchx.specs import Resource
- ddp
from torchx.components import ddp
import torchx.components.dist.ddp
Quickstart
from torchx import specs
from torchx.runner import TorchxRunner
# Define a simple application
app = specs.AppDef(
name='hello-world',
roles=[
specs.Role(
name='worker',
entrypoint='echo',
args=['Hello, TorchX!'],
num_replicas=1,
resource=specs.Resource(cpu=1, memMB=512)
)
]
)
# Initialize the TorchX runner
runner = TorchxRunner()
# Launch the application on the local_cwd scheduler
# Ensure you have a 'local_cwd' scheduler configured or just use 'local_cwd'
app_handle = runner.run(app, scheduler='local_cwd')
print(f"Application '{app.name}' launched with handle: {app_handle}")
# You can optionally wait for the application to complete
# runner.wait(app_handle, timeout=300) # Waits up to 5 minutes
# print(f"Application '{app.name}' completed.")