DataHub Python CLI and SDK

1.5.0.5 · active · verified Thu Apr 09

The `acryl-datahub` package provides a powerful Command Line Interface (CLI) and a Python SDK for interacting with DataHub, an open-source metadata platform. DataHub serves as a central nervous system for your data stack, enabling discovery, governance, and observability across various data assets. Currently at version 1.5.0.5, the library maintains an active release cadence with frequent updates and release candidates, ensuring ongoing feature development and stability.

Warnings

Install

Imports

Quickstart

This quickstart first outlines how to set up a local DataHub instance using the CLI's `docker quickstart` command. Following this, it provides a Python snippet demonstrating how to programmatically connect to a DataHub server using the `DatahubRestEmitter` and publish basic dataset properties.

import os
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import DatasetPropertiesClass

# --- CLI Quickstart (run in your terminal) ---
# 1. Install Docker and Docker Compose v2.
# 2. Start a local DataHub instance:
#    datahub docker quickstart
#    (This command might take some time to download and start services)
#
# --- Python SDK Example (after DataHub is running) ---
# For local quickstart, GMS server is typically http://localhost:8080
gms_server = os.environ.get("DATAHUB_GMS_SERVER", "http://localhost:8080")
token = os.environ.get("DATAHUB_GMS_TOKEN", "") # For cloud/secured instances, provide a token

# Initialize the REST emitter
# Note: The 'token' parameter is available for direct use, not just extra_headers.
emitter = DatahubRestEmitter(gms_server=gms_server, token=token)

# Define a sample dataset URN
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,sample_dataset,PROD)"

# Create a DatasetProperties aspect
dataset_properties = DatasetPropertiesClass(
    description="This is a sample dataset emitted via the Python SDK quickstart.",
    customProperties={
        "owner_team": "data_platform",
        "environment": "production_dev"
    }
)

# Create a MetadataChangeProposalWrapper
mcp = MetadataChangeProposalWrapper(
    entityUrn=dataset_urn,
    aspect=dataset_properties,
)

# Emit the metadata change proposal
try:
    emitter.emit(mcp)
    print(f"Successfully emitted properties for dataset: {dataset_urn}")
except Exception as e:
    print(f"Failed to emit metadata: {e}")
    print("Ensure your DataHub instance is running and accessible at", gms_server)

view raw JSON →