Substrait Python Bindings
The `substrait` Python package provides an interface for interacting with the Substrait specification, a cross-language intermediate representation for data compute operations. It allows users to construct, manipulate, and serialize/deserialize Substrait Plans in Python. It is explicitly stated as an experimental package that is still under active development, not an execution engine. The current version is 0.29.0, released in March 2026, with the core Substrait specification having a frequent release cadence that includes breaking changes.
Warnings
- gotcha The `substrait` Python package is explicitly marked as 'experimental' and 'still under development'. This means its API and behavior may change frequently without adhering strictly to semantic versioning for minor releases, potentially causing unexpected breakages.
- gotcha This library is *not* an execution engine for Substrait plans. Its primary purpose is to provide a Python interface for *producing* and *consuming* Substrait plans, which are then meant to be executed by external Substrait-compliant data compute engines (e.g., DataFusion, DuckDB).
- breaking The underlying Substrait specification itself undergoes breaking changes, and the Python bindings are tightly coupled to this specification. For example, a significant URI to URN migration occurred in 2025 across the Substrait ecosystem. Such changes in the spec will lead to corresponding breaking changes in the Python library.
- gotcha Compatibility with other Substrait tools (producers, consumers, validators) can be complex due to the evolving nature of the Substrait specification. Different versions of consumers or validators may only support specific ranges of the Substrait spec, which the Python library reflects.
Install
-
pip install substrait -
conda install -c conda-forge python-substrait
Imports
- proto
from substrait import proto
Quickstart
from substrait import proto
# Example: Create a simple Substrait Plan equivalent to SELECT first_name FROM person
plan = proto.Plan(
relations=[
proto.PlanRel(
root=proto.RelRoot(
names=["first_name"],
input=proto.Rel(
read=proto.ReadRel(
named_table=proto.ReadRel.NamedTable(names=["people"]),
base_schema=proto.NamedStruct(
names=["first_name", "surname"],
struct=proto.Type.Struct(
types=[
proto.Type(string=proto.Type.String(nullability=proto.Type.Nullability.NULLABILITY_REQUIRED)),
proto.Type(string=proto.Type.String(nullability=proto.Type.Nullability.NULLABILITY_REQUIRED))
]
)
)
)
)
)
)
]
)
print(plan)
serialized_plan = plan.SerializeToString()
print(f"Serialized plan length: {len(serialized_plan)} bytes")
# To consume a plan from bytes:
# loaded_plan = proto.Plan()
# loaded_plan.ParseFromString(serialized_plan)
# print(loaded_plan)
# To load a plan from JSON (assuming you have a JSON string 'json_plan_str'):
# from google.protobuf import json_format
# json_plan_str = "{... your JSON plan ...}"
# loaded_plan_from_json = json_format.Parse(json_plan_str, proto.Plan())
# print(loaded_plan_from_json)