Databricks SQL Connector for Python
The Databricks SQL Connector for Python is a Python library that enables running SQL commands on Databricks clusters and SQL warehouses. It is a Thrift-based client, conforms to the Python DB API 2.0 specification, and uses Apache Arrow for efficient data exchange. The library is actively maintained with frequent releases, often multiple times a month.
Warnings
- breaking PyArrow is no longer a default dependency since version 4.0.0. Users must explicitly install `databricks-sql-connector[pyarrow]` or `pip install pyarrow` to enable Arrow-based features like CloudFetch and `fetchmany_arrow`. Failing to do so may impact performance for large datasets.
- breaking The SQLAlchemy dialect for Databricks was split into a separate `databricks-sqlalchemy` package in version 4.0.0. Users leveraging SQLAlchemy must now explicitly install `databricks-sqlalchemy` alongside the core connector.
- gotcha When using `pandas.read_sql` directly with a `databricks-sql-connector` connection object (v3.0.0+), a `UserWarning` regarding DB API 2.0 support might appear. While this warning can generally be ignored due to PyArrow's efficiency, using `databricks-sqlalchemy` as the engine provides a warning-free experience and broader compatibility.
- breaking The `pandas.DataFrame.to_sql()` method effectively broke for inserts exceeding 255 values (not rows) into Delta tables with `databricks-sql-connector` versions 3.0.0 and above, due to the introduction of native parameters and a server-side limitation.
- breaking Version 4.2.0 introduced changes to `autocommit` properties on the connection object. Directly setting `connection.autocommit = False` via API calls (e.g., in an Airflow hook) can now lead to `TransactionError: [CONFIG_NOT_AVAILABLE] Configuration AUTOCOMMIT is not available. SQLSTATE: 42K0I`. Multi-statement transaction control should be managed through `connection.autocommit = False` then `connection.commit()` and `connection.rollback()`.
- gotcha While PyPI metadata states Python >=3.8.0 is supported, the official GitHub README for the latest versions (4.x.x) recommends Python 3.9 or above for full compatibility and access to the latest features and stability.
Install
-
pip install databricks-sql-connector -
pip install databricks-sql-connector[pyarrow]
Imports
- sql
from databricks import sql
Quickstart
import os
from databricks import sql
# Ensure these environment variables are set:
# DATABRICKS_SERVER_HOSTNAME, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN
host = os.environ.get('DATABRICKS_SERVER_HOSTNAME', 'your_server_hostname.databricks.com')
http_path = os.environ.get('DATABRICKS_HTTP_PATH', '/sql/1.0/endpoints/your_sql_warehouse_id')
access_token = os.environ.get('DATABRICKS_TOKEN', 'dapiXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
if not all([host, http_path, access_token]):
print("Please set DATABRICKS_SERVER_HOSTNAME, DATABRICKS_HTTP_PATH, and DATABRICKS_TOKEN environment variables.")
else:
try:
with sql.connect(
server_hostname=host,
http_path=http_path,
access_token=access_token
) as connection:
with connection.cursor() as cursor:
cursor.execute("SELECT 1 as id, 'hello' as message")
result = cursor.fetchall()
for row in result:
print(row)
except Exception as e:
print(f"An error occurred: {e}")