PySpark Bindings for H3
h3-pyspark provides PySpark bindings for Uber's H3 hierarchical hexagonal geospatial indexing system. It allows for efficient geospatial operations and analysis directly within Spark data pipelines by exposing H3 functions as Spark UDFs and native Spark functions. The library is currently at version 1.2.6 and receives active development and maintenance, with recent releases addressing bug fixes and edge cases.
Common errors
-
TypeError: Invalid argument, not a string or column: DataFrame[lat: double, lng: double] of type <class 'pyspark.sql.dataframe.DataFrame'>
cause You are passing an entire DataFrame or a literal of the wrong type to an H3 function that expects a Spark Column expression.fixWhen calling H3 functions (e.g., `geo_to_h3`, `index_shape`), ensure you pass Spark Column objects (e.g., `F.col('column_name')`) or appropriate literals, not entire DataFrames or Python native types directly if the function expects a Column. -
Py4JJavaError: An error occurred while calling o.apache.spark.sql.functions.udf
cause This generic PySpark error often indicates an issue in the underlying Java/Scala code when a UDF (User Defined Function) from h3-pyspark encounters invalid input or an unhandled condition.fixCheck the full stack trace for more specific details. Common causes include invalid H3 resolutions (e.g., negative, too high), malformed H3 indices, or corrupted/unexpected data types in the input columns. Ensure input data conforms to H3 requirements and has no unexpected nulls or invalid values. -
AnalysisException: 'No such struct field <field_name> in <schema_string>'
cause You are trying to access a column that does not exist in your Spark DataFrame, often due to a typo or incorrect case in the column name.fixVerify the exact names and casing of columns in your DataFrame (e.g., `df.printSchema()`) and ensure they match the column names used in `h3_pyspark` function calls (e.g., `h3_pyspark.geo_to_h3('latitude_column', 'longitude_column', 'res_column')`).
Warnings
- breaking The underlying `h3-py` library (which `h3-pyspark` wraps) introduced significant breaking changes in its 4.x versions, primarily around function naming conventions (e.g., `kRing` became `gridDisk`) and error handling.
- gotcha Prior to version 1.2.4, `h3-pyspark` functions might not robustly handle null values in input columns to UDFs, potentially leading to errors or unexpected behavior.
- gotcha The `index_shape` function in versions prior to 1.2.3 had a known bug where it might miss H3 cells for long line segments, leading to incomplete or inaccurate spatial indexing for complex geometries.
- gotcha h3-pyspark assumes that geospatial geometries are represented as GeoJSON strings within a Spark DataFrame column, rather than other formats like WKT.
Install
-
pip install h3-pyspark -
conda install -c conda-forge h3-pyspark
Imports
- h3_pyspark
import h3_pyspark
- geo_to_h3
from h3_pyspark.functions import geo_to_h3
df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
Quickstart
from pyspark.sql import SparkSession, functions as F
import h3_pyspark
import os
# Initialize Spark Session (adjust master for your environment, e.g., 'local[*]'):
spark = SparkSession.builder.master(os.environ.get('SPARK_MASTER', 'local[*]')).appName("H3PySparkQuickstart").getOrCreate()
# Create a DataFrame with latitude, longitude, and desired H3 resolution
data = [{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}]
df = spark.createDataFrame(data)
# Convert geographic coordinates to H3 index
df_with_h3 = df.withColumn('h3_index', h3_pyspark.geo_to_h3(F.col('lat'), F.col('lng'), F.col('resolution')))
df_with_h3.show()
# Example of an extension function: index_shape for GeoJSON polygons
geojson_polygon = "{\"type\":\"Polygon\",\"coordinates\":[[[-122.4,37.8],[-122.3,37.8],[-122.3,37.7],[-122.4,37.7],[-122.4,37.8]]]}"
polygon_df = spark.createDataFrame([{'id': 1, 'geometry': geojson_polygon, 'resolution': 9}])
polygon_h3_df = polygon_df.withColumn(
'h3_cells',
h3_pyspark.index_shape(F.col('geometry'), F.col('resolution'))
)
polygon_h3_df.show(truncate=False)
spark.stop()