Databricks Connect Client

18.1.2 · active · verified Thu Apr 09

Databricks Connect allows you to connect popular IDEs, notebook servers, and custom applications to Databricks clusters. It's a client library that configures standard PySpark APIs to run commands remotely on Databricks clusters, enabling local development and debugging against data on a remote cluster. The current version is 18.1.2, with releases tied closely to Databricks Runtime (DBR) versions, typically aligning with DBR major/LTS releases and subsequent patch updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a SparkSession with Databricks Connect using environment variables for configuration. Ensure your `DATABRICKS_HOST`, `DATABRICKS_TOKEN`, and `DATABRICKS_CLUSTER_ID` are set correctly. The `display()` command requires a Databricks environment; use `show()` for local console output.

import os
from pyspark.sql import SparkSession

# Configure environment variables (replace with your actual values)
# For DBR 13.x and later, DATABRICKS_CLUSTER_ID is required.
# For DBR 12.x and earlier, DATABRICKS_ORG_ID might be required.
# DATABRICKS_PORT is optional, defaults to 15001.

os.environ['DATABRICKS_HOST'] = os.environ.get('DATABRICKS_HOST', 'https://your-databricks-instance.cloud.databricks.com')
os.environ['DATABRICKS_TOKEN'] = os.environ.get('DATABRICKS_TOKEN', 'dapi...')
os.environ['DATABRICKS_CLUSTER_ID'] = os.environ.get('DATABRICKS_CLUSTER_ID', 'your-cluster-id')
# os.environ['DATABRICKS_ORG_ID'] = os.environ.get('DATABRICKS_ORG_ID', 'your-org-id') # Often not needed for modern DBR/configurations
# os.environ['DATABRICKS_PORT'] = os.environ.get('DATABRICKS_PORT', '15001') # Default is 15001

# Initialize SparkSession using Databricks Connect
# The .builder.getOrCreate() method automatically picks up DATABRICKS_ environment variables.
spark = SparkSession.builder.getOrCreate()

# Example: Run a simple Spark command
df = spark.range(10).toDF("id")
df.display()
# df.show() # For local console output

print("Successfully connected to Databricks cluster and ran a Spark command.")

# Clean up environment variables if running multiple tests or configurations
# del os.environ['DATABRICKS_HOST']
# del os.environ['DATABRICKS_TOKEN']
# del os.environ['DATABRICKS_CLUSTER_ID']
# (and others you set)

view raw JSON →