DataHub Python CLI and SDK
The `acryl-datahub` package provides a powerful Command Line Interface (CLI) and a Python SDK for interacting with DataHub, an open-source metadata platform. DataHub serves as a central nervous system for your data stack, enabling discovery, governance, and observability across various data assets. Currently at version 1.5.0.5, the library maintains an active release cadence with frequent updates and release candidates, ensuring ongoing feature development and stability.
Warnings
- breaking Python 3.9 support has been officially dropped. All `acryl-datahub` packages now require Python 3.10 or later.
- breaking The V1 UI theme is officially sunset as of v1.5.0. All development targets the V2 UI going forward. If you're self-hosting, ensure your GMS environment variables `THEME_V2_ENABLED` and `THEME_V2_DEFAULT` are set to `true`.
- breaking The `acryl-datahub` package now requires Pydantic v2. Support for Pydantic v1 has been dropped.
- breaking SQL view query IDs now use SHA-256 hashes instead of URL-encoding the view URN. This means old query entities for view lineage tracking will become orphaned.
- gotcha For DataHub CLI version 1.5, the handling of the token signing key for Metadata Service Authentication has changed. If not explicitly set via environment variables, new random values are generated and stored locally (`~/.datahub/quickstart/.local-secrets.env`).
- gotcha The `DatahubRestEmitter.emit()` method (and `emit_mcp()`) now returns `Optional[TraceData]` instead of `None` or an `int`. This change exposes trace IDs for SYNC_PRIMARY and ASYNC modes.
Install
-
pip install acryl-datahub -
pip install 'acryl-datahub[datahub-rest]' # For programmatic interaction over REST
Imports
- DatahubRestEmitter
from datahub.emitter.rest_emitter import DatahubRestEmitter
- MetadataChangeProposalWrapper
from datahub.emitter.mcp import MetadataChangeProposalWrapper
- DatasetPropertiesClass
from datahub.metadata.schema_classes import DatasetPropertiesClass
- DatahubClientConfig
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
Quickstart
import os
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import DatasetPropertiesClass
# --- CLI Quickstart (run in your terminal) ---
# 1. Install Docker and Docker Compose v2.
# 2. Start a local DataHub instance:
# datahub docker quickstart
# (This command might take some time to download and start services)
#
# --- Python SDK Example (after DataHub is running) ---
# For local quickstart, GMS server is typically http://localhost:8080
gms_server = os.environ.get("DATAHUB_GMS_SERVER", "http://localhost:8080")
token = os.environ.get("DATAHUB_GMS_TOKEN", "") # For cloud/secured instances, provide a token
# Initialize the REST emitter
# Note: The 'token' parameter is available for direct use, not just extra_headers.
emitter = DatahubRestEmitter(gms_server=gms_server, token=token)
# Define a sample dataset URN
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,sample_dataset,PROD)"
# Create a DatasetProperties aspect
dataset_properties = DatasetPropertiesClass(
description="This is a sample dataset emitted via the Python SDK quickstart.",
customProperties={
"owner_team": "data_platform",
"environment": "production_dev"
}
)
# Create a MetadataChangeProposalWrapper
mcp = MetadataChangeProposalWrapper(
entityUrn=dataset_urn,
aspect=dataset_properties,
)
# Emit the metadata change proposal
try:
emitter.emit(mcp)
print(f"Successfully emitted properties for dataset: {dataset_urn}")
except Exception as e:
print(f"Failed to emit metadata: {e}")
print("Ensure your DataHub instance is running and accessible at", gms_server)