Databricks Connect Client
Databricks Connect allows you to connect popular IDEs, notebook servers, and custom applications to Databricks clusters. It's a client library that configures standard PySpark APIs to run commands remotely on Databricks clusters, enabling local development and debugging against data on a remote cluster. The current version is 18.1.2, with releases tied closely to Databricks Runtime (DBR) versions, typically aligning with DBR major/LTS releases and subsequent patch updates.
Warnings
- breaking Databricks Connect client versions are strictly tied to specific Python versions. For instance, Databricks Connect 18.x requires Python 3.12. Using an incompatible Python version will result in installation failures or runtime errors.
- breaking The Databricks Connect client version must exactly match the Databricks Runtime (DBR) version of the cluster you are connecting to. Mismatched versions will lead to connection errors or unexpected behavior.
- gotcha Installing `databricks-connect` with an existing `pyspark` installation can lead to dependency conflicts or unexpected `pyspark` version mismatches. `databricks-connect` bundles its own compatible `pyspark`.
- gotcha Incorrect configuration of connection parameters (host, token, cluster ID, sometimes org ID or port) is a very common reason for connection failures.
Install
-
pip install databricks-connect==18.1.2
Imports
- SparkSession
from pyspark.sql import SparkSession
Quickstart
import os
from pyspark.sql import SparkSession
# Configure environment variables (replace with your actual values)
# For DBR 13.x and later, DATABRICKS_CLUSTER_ID is required.
# For DBR 12.x and earlier, DATABRICKS_ORG_ID might be required.
# DATABRICKS_PORT is optional, defaults to 15001.
os.environ['DATABRICKS_HOST'] = os.environ.get('DATABRICKS_HOST', 'https://your-databricks-instance.cloud.databricks.com')
os.environ['DATABRICKS_TOKEN'] = os.environ.get('DATABRICKS_TOKEN', 'dapi...')
os.environ['DATABRICKS_CLUSTER_ID'] = os.environ.get('DATABRICKS_CLUSTER_ID', 'your-cluster-id')
# os.environ['DATABRICKS_ORG_ID'] = os.environ.get('DATABRICKS_ORG_ID', 'your-org-id') # Often not needed for modern DBR/configurations
# os.environ['DATABRICKS_PORT'] = os.environ.get('DATABRICKS_PORT', '15001') # Default is 15001
# Initialize SparkSession using Databricks Connect
# The .builder.getOrCreate() method automatically picks up DATABRICKS_ environment variables.
spark = SparkSession.builder.getOrCreate()
# Example: Run a simple Spark command
df = spark.range(10).toDF("id")
df.display()
# df.show() # For local console output
print("Successfully connected to Databricks cluster and ran a Spark command.")
# Clean up environment variables if running multiple tests or configurations
# del os.environ['DATABRICKS_HOST']
# del os.environ['DATABRICKS_TOKEN']
# del os.environ['DATABRICKS_CLUSTER_ID']
# (and others you set)