PyStarburst
PyStarburst provides a Python DataFrame API for querying and transforming data directly within Starburst Galaxy and Starburst Enterprise Platform (SEP) clusters. It enables data engineers and developers to build complex transformation pipelines and data applications using familiar Python syntax without needing to download data locally. The library is actively maintained, with version 0.11.0 released in February 2026, and a release cadence of approximately every few months.
Common errors
-
AttributeError: 'DataFrame' object has no attribute 'my_column'
cause Attempting to access a DataFrame column using dot notation (e.g., `df.my_column`) instead of the correct `col` function or bracket notation. This is a common mistake when migrating from other DataFrame libraries like Pandas or PySpark where dot notation might be used for column access.fixAccess DataFrame columns using `df.col("my_column")` or `df["my_column"]`. For example, `df.filter(df.col("my_column") == 'value')`. -
TypeError: unsupported operand type(s) for &: 'bool' and 'Column'
cause This error often occurs when mixing Python's native boolean logic (`True`/`False`) with `Column` expressions in a way that the `Column` object expects another `Column` object for bitwise operations, or when directly using `and`/`or` on `Column` objects, which Python interprets as standard boolean `and`/`or` on truthiness.fixEnsure all parts of a boolean expression involving `Column` objects are themselves `Column` objects or literal values that can be implicitly converted. Always use `&` for AND, `|` for OR, and `~` for NOT when combining `Column` expressions. For example, `df.filter((df.col("age") > 18) & (df.col("city") == "New York"))`. -
OutOfMemoryError: Java heap space
cause While `PyStarburst` pushes computation to the Starburst cluster, an `OutOfMemoryError` can occur if you use `DataFrame.collect()` on a very large result set, attempting to load all data into the client's memory.fixAvoid `collect()` for large datasets. Instead, use `df.show()` for quick inspection, `df.write.save_as_table()` to persist results back into Starburst, or apply further transformations within PyStarburst to aggregate or filter the data before collecting a smaller subset.
Warnings
- breaking In PyStarburst 0.8.0, the date and time format used by `to_date` and `to_timestamp` functions changed from Teradata's `yyyy-mm-dd hh24:mi:ss` to JodaTime's more common `yyyy-MM-dd HH:mm:ss` format.
- gotcha Using Python's logical operators (`and`, `or`, `not`) directly on `Column` objects will raise an error. PyStarburst `Column` objects overload bitwise operators for logical operations.
- gotcha The `DataFrame.collect()` method pulls all data from the Starburst cluster into your local Python environment's memory. For large datasets, this can lead to Out-of-Memory (OOM) errors and is not scalable.
- deprecated PyStarburst 0.11.0 dropped support for Python 3.9 as it reached end-of-life.
Install
-
pip install pystarburst
Imports
- Session
from pystarburst import Session
- col
from pystarburst.functions import col
- BasicAuthentication
from trino.auth import BasicAuthentication
Quickstart
import os
from pystarburst import Session
from trino.auth import BasicAuthentication
# Replace with your Starburst cluster details from Partner Connect
host = os.environ.get('STARBURST_HOST', 'your-starburst-host.trino.galaxy.starburst.io')
port = int(os.environ.get('STARBURST_PORT', '443'))
user = os.environ.get('STARBURST_USER', 'your-user@example.com')
password = os.environ.get('STARBURST_PASSWORD', 'your_password')
catalog = os.environ.get('STARBURST_CATALOG', 'sample') # e.g., 'hive', 'iceberg'
schema = os.environ.get('STARBURST_SCHEMA', 'burstbank') # e.g., 'default'
db_parameters = {
"host": host,
"port": port,
"http_scheme": "https",
"catalog": catalog,
"schema": schema,
"auth": BasicAuthentication(user, password)
}
try:
session = Session.builder.configs(db_parameters).create()
print("Successfully connected to Starburst!")
# Example: Querying a table
df = session.sql("SELECT * FROM system.runtime.nodes").show()
print("Query executed successfully.")
# Example: Creating a DataFrame and applying a simple transformation
# df_nation = session.table("nation") # Assuming 'nation' table exists in 'sample.burstbank'
# df_filtered = df_nation.filter(df_nation.col("regionkey") == 0)
# df_filtered.show()
finally:
if 'session' in locals() and session:
session.close()
print("Session closed.")