{"id":7608,"library":"pystarburst","title":"PyStarburst","description":"PyStarburst provides a Python DataFrame API for querying and transforming data directly within Starburst Galaxy and Starburst Enterprise Platform (SEP) clusters. It enables data engineers and developers to build complex transformation pipelines and data applications using familiar Python syntax without needing to download data locally. The library is actively maintained, with version 0.11.0 released in February 2026, and a release cadence of approximately every few months.","status":"active","version":"0.11.0","language":"en","source_language":"en","source_url":"https://github.com/starburstdata/pystarburst-examples","tags":["dataframes","starburst","querying","etl","trino","data-engineering"],"install":[{"cmd":"pip install pystarburst","lang":"bash","label":"Install PyStarburst"}],"dependencies":[{"reason":"Required for connecting to Trino-based Starburst clusters and handling authentication.","package":"trino-python-client","optional":false},{"reason":"Used internally for data validation and configuration management.","package":"pydantic","optional":false},{"reason":"Optional, for exporting PyStarburst DataFrames to Pandas DataFrames for local analysis or file output (e.g., CSV/Parquet).","package":"pandas","optional":true}],"imports":[{"note":"The primary entry point for creating a connection session to Starburst.","symbol":"Session","correct":"from pystarburst import Session"},{"note":"Used to reference DataFrame columns for transformations and selections.","symbol":"col","correct":"from pystarburst.functions import col"},{"note":"Needed for basic username/password authentication when creating a session. The 'trino' package is an implicit dependency.","symbol":"BasicAuthentication","correct":"from trino.auth import BasicAuthentication"}],"quickstart":{"code":"import os\nfrom pystarburst import Session\nfrom trino.auth import BasicAuthentication\n\n# Replace with your Starburst cluster details from Partner Connect\nhost = os.environ.get('STARBURST_HOST', 'your-starburst-host.trino.galaxy.starburst.io')\nport = int(os.environ.get('STARBURST_PORT', '443'))\nuser = os.environ.get('STARBURST_USER', 'your-user@example.com')\npassword = os.environ.get('STARBURST_PASSWORD', 'your_password')\ncatalog = os.environ.get('STARBURST_CATALOG', 'sample') # e.g., 'hive', 'iceberg'\nschema = os.environ.get('STARBURST_SCHEMA', 'burstbank') # e.g., 'default'\n\ndb_parameters = {\n    \"host\": host,\n    \"port\": port,\n    \"http_scheme\": \"https\",\n    \"catalog\": catalog,\n    \"schema\": schema,\n    \"auth\": BasicAuthentication(user, password)\n}\n\ntry:\n    session = Session.builder.configs(db_parameters).create()\n    print(\"Successfully connected to Starburst!\")\n\n    # Example: Querying a table\n    df = session.sql(\"SELECT * FROM system.runtime.nodes\").show()\n    print(\"Query executed successfully.\")\n\n    # Example: Creating a DataFrame and applying a simple transformation\n    # df_nation = session.table(\"nation\") # Assuming 'nation' table exists in 'sample.burstbank'\n    # df_filtered = df_nation.filter(df_nation.col(\"regionkey\") == 0)\n    # df_filtered.show()\n\nfinally:\n    if 'session' in locals() and session:\n        session.close()\n        print(\"Session closed.\")","lang":"python","description":"This quickstart demonstrates how to establish a connection to a Starburst cluster using PyStarburst and execute a basic SQL query. It uses environment variables for sensitive connection parameters. You will need to replace placeholder values with your actual Starburst Galaxy or SEP cluster details."},"warnings":[{"fix":"Update `to_date` and `to_timestamp` function calls in your code to use JodaTime's `yyyy-MM-dd HH:mm:ss` format string for compatibility.","message":"In PyStarburst 0.8.0, the date and time format used by `to_date` and `to_timestamp` functions changed from Teradata's `yyyy-mm-dd hh24:mi:ss` to JodaTime's more common `yyyy-MM-dd HH:mm:ss` format.","severity":"breaking","affected_versions":"0.8.0 and later"},{"fix":"Replace `and` with `&`, `or` with `|`, and `not` with `~` when constructing boolean expressions with `Column` objects. E.g., `(df.col1 > 1) & (df.col2 < 10)` instead of `(df.col1 > 1) and (df.col2 < 10)`.","message":"Using Python's logical operators (`and`, `or`, `not`) directly on `Column` objects will raise an error. PyStarburst `Column` objects overload bitwise operators for logical operations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For persisting large datasets, use `DataFrame.write.save_as_table()` to leverage Starburst's distributed writing capabilities. For sampling or inspecting data, use `DataFrame.show()` or `DataFrame.limit().collect()` to retrieve a subset.","message":"The `DataFrame.collect()` method pulls all data from the Starburst cluster into your local Python environment's memory. For large datasets, this can lead to Out-of-Memory (OOM) errors and is not scalable.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure your environment uses Python 3.10 or later (up to 3.13) to be compatible with PyStarburst 0.11.0 and future releases.","message":"PyStarburst 0.11.0 dropped support for Python 3.9 as it reached end-of-life.","severity":"deprecated","affected_versions":"0.11.0 and later"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Access DataFrame columns using `df.col(\"my_column\")` or `df[\"my_column\"]`. For example, `df.filter(df.col(\"my_column\") == 'value')`.","cause":"Attempting to access a DataFrame column using dot notation (e.g., `df.my_column`) instead of the correct `col` function or bracket notation. This is a common mistake when migrating from other DataFrame libraries like Pandas or PySpark where dot notation might be used for column access.","error":"AttributeError: 'DataFrame' object has no attribute 'my_column'"},{"fix":"Ensure all parts of a boolean expression involving `Column` objects are themselves `Column` objects or literal values that can be implicitly converted. Always use `&` for AND, `|` for OR, and `~` for NOT when combining `Column` expressions. For example, `df.filter((df.col(\"age\") > 18) & (df.col(\"city\") == \"New York\"))`.","cause":"This error often occurs when mixing Python's native boolean logic (`True`/`False`) with `Column` expressions in a way that the `Column` object expects another `Column` object for bitwise operations, or when directly using `and`/`or` on `Column` objects, which Python interprets as standard boolean `and`/`or` on truthiness.","error":"TypeError: unsupported operand type(s) for &: 'bool' and 'Column'"},{"fix":"Avoid `collect()` for large datasets. Instead, use `df.show()` for quick inspection, `df.write.save_as_table()` to persist results back into Starburst, or apply further transformations within PyStarburst to aggregate or filter the data before collecting a smaller subset.","cause":"While `PyStarburst` pushes computation to the Starburst cluster, an `OutOfMemoryError` can occur if you use `DataFrame.collect()` on a very large result set, attempting to load all data into the client's memory.","error":"OutOfMemoryError: Java heap space"}]}