Apache Sedona
Apache Sedona™ is a cluster computing system for processing large-scale spatial data, extending modern cluster computing systems like Apache Spark, Apache Flink, and Snowflake with Spatial Resilient Distributed Datasets (SRDDs), Spatial SQL, and Spatial DataFrames. It enables developers to efficiently load, process, and analyze large-scale spatial data across machines. The current stable version is 1.8.1, and the project maintains an active release cadence with multiple major and minor updates throughout the year.
Warnings
- breaking Apache Sedona 1.8.0 and later versions dropped support for Java 8 and Apache Spark 3.3. Users must upgrade to Java 11+ and Apache Spark 3.4+ to use these versions.
- gotcha When using `apache-sedona` with Apache Spark, a `sedona-spark-shaded` (or `sedona-spark`) JAR file, compatible with your Spark and Scala versions, is required. This JAR must be either placed in `SPARK_HOME/jars/` or specified via Spark configuration (e.g., `spark.jars.packages`). Failing to include the correct JAR can lead to `NoClassDefFoundError` or `NoSuchMethodError` for spatial functions.
- deprecated Since Apache Sedona 1.5.0, the separate `sedona-python-adapter` JAR is no longer released, as its functionality was merged into the main `sedona-spark` JAR. Using or including older `sedona-python-adapter` JARs with newer Sedona versions can lead to dependency conflicts and runtime errors.
- gotcha In Apache Sedona versions 1.0.1 and earlier, the `pyspark` dependency in `setup.py` was mistakenly configured to be `< v3.1.0`. This could cause `pip` to automatically uninstall a newer `pyspark` version (e.g., 3.1.1) and install an older one (e.g., 3.0.2) upon `apache-sedona` installation, leading to version conflicts.
Install
-
pip install apache-sedona -
pip install apache-sedona[spark] -
pip install "apache-sedona[db]"
Imports
- SedonaContext
from sedona.spark import SedonaContext
- sedona.db
import sedona.db
Quickstart
import sedona.db
from shapely.geometry import Point
# 1. Connect to SedonaDB (single-node engine for local quickstart)
sd = sedona.db.connect()
# 2. Create a DataFrame with spatial data
data = [
{"id": 1, "name": "Central Park", "geometry": Point(40.7812, -73.9665).wkt},
{"id": 2, "name": "Empire State Building", "geometry": Point(40.7484, -73.9857).wkt},
{"id": 3, "name": "Times Square", "geometry": Point(40.7580, -73.9855).wkt},
]
# Convert WKT strings to SedonaDB geometry objects
df = sd.create_dataframe(data).with_column("geometry", sd.st_geomfromwkt(sd.column("geometry")))
print("Original DataFrame:")
df.print_schema()
df.show()
# 3. Perform a spatial SQL query
sd.create_view("nyc_landmarks", df) # Expose DataFrame as a temporary view
# Find landmarks within a certain distance of a reference point
reference_point = Point(40.75, -73.98).wkt # A point near Midtown
result = sd.sql(
f"""SELECT name, ST_Distance(geometry, ST_GeomFromWKT('{reference_point}')) as distance_to_ref
FROM nyc_landmarks
WHERE ST_DWithin(geometry, ST_GeomFromWKT('{reference_point}'), 0.05) -- 0.05 degrees approx. 5.5km
ORDER BY distance_to_ref
"""
)
print("\nLandmarks within 0.05 degrees of the reference point:")
result.show()