{"id":7282,"library":"h3-pyspark","title":"PySpark Bindings for H3","description":"h3-pyspark provides PySpark bindings for Uber's H3 hierarchical hexagonal geospatial indexing system. It allows for efficient geospatial operations and analysis directly within Spark data pipelines by exposing H3 functions as Spark UDFs and native Spark functions. The library is currently at version 1.2.6 and receives active development and maintenance, with recent releases addressing bug fixes and edge cases.","status":"active","version":"1.2.6","language":"en","source_language":"en","source_url":"https://github.com/kevinschaich/h3-pyspark","tags":["pyspark","h3","geospatial","indexing","spark","gis"],"install":[{"cmd":"pip install h3-pyspark","lang":"bash","label":"PyPI"},{"cmd":"conda install -c conda-forge h3-pyspark","lang":"bash","label":"Conda"}],"dependencies":[{"reason":"Provides the Spark DataFrame API and execution environment for the H3 operations. This library is a binding to PySpark.","package":"pyspark","optional":false},{"reason":"The core Python binding for the H3 geospatial indexing system, which h3-pyspark wraps and extends.","package":"h3","optional":false}],"imports":[{"note":"The primary way to import the library and access its functions.","symbol":"h3_pyspark","correct":"import h3_pyspark"},{"note":"Functions are typically accessed directly via the imported `h3_pyspark` module, not from a `functions` submodule.","wrong":"from h3_pyspark.functions import geo_to_h3","symbol":"geo_to_h3","correct":"df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))"}],"quickstart":{"code":"from pyspark.sql import SparkSession, functions as F\nimport h3_pyspark\nimport os\n\n# Initialize Spark Session (adjust master for your environment, e.g., 'local[*]'):\nspark = SparkSession.builder.master(os.environ.get('SPARK_MASTER', 'local[*]')).appName(\"H3PySparkQuickstart\").getOrCreate()\n\n# Create a DataFrame with latitude, longitude, and desired H3 resolution\ndata = [{\"lat\": 37.769377, \"lng\": -122.388903, 'resolution': 9}]\ndf = spark.createDataFrame(data)\n\n# Convert geographic coordinates to H3 index\ndf_with_h3 = df.withColumn('h3_index', h3_pyspark.geo_to_h3(F.col('lat'), F.col('lng'), F.col('resolution')))\n\ndf_with_h3.show()\n\n# Example of an extension function: index_shape for GeoJSON polygons\ngeojson_polygon = \"{\\\"type\\\":\\\"Polygon\\\",\\\"coordinates\\\":[[[-122.4,37.8],[-122.3,37.8],[-122.3,37.7],[-122.4,37.7],[-122.4,37.8]]]}\"\npolygon_df = spark.createDataFrame([{'id': 1, 'geometry': geojson_polygon, 'resolution': 9}])\n\npolygon_h3_df = polygon_df.withColumn(\n    'h3_cells',\n    h3_pyspark.index_shape(F.col('geometry'), F.col('resolution'))\n)\n\npolygon_h3_df.show(truncate=False)\n\nspark.stop()","lang":"python","description":"This quickstart demonstrates how to initialize a SparkSession, create a DataFrame with geospatial coordinates, and use `h3_pyspark.geo_to_h3` to convert latitude and longitude to an H3 index. It also includes an example of `h3_pyspark.index_shape` for indexing GeoJSON polygons. Ensure `pyspark` is configured correctly for your environment."},"warnings":[{"fix":"Refer to the `h3-py` migration guide for changes between H3 v3.x and v4.x. Adapt your code to the new function names and error handling. Verify the version of `h3-py` installed alongside `h3-pyspark` to ensure compatibility.","message":"The underlying `h3-py` library (which `h3-pyspark` wraps) introduced significant breaking changes in its 4.x versions, primarily around function naming conventions (e.g., `kRing` became `gridDisk`) and error handling.","severity":"breaking","affected_versions":"Users upgrading `h3-py` dependency to 4.x alongside h3-pyspark 1.x, or migrating code from `h3-py` 3.x to `h3-pyspark` 1.x with implicit `h3-py` 4.x."},{"fix":"Upgrade to `h3-pyspark` version 1.2.4 or newer to benefit from improved null value handling. Ensure your input data is clean, or explicitly handle nulls (e.g., `na.drop()`, `fillna()`) before passing to H3 functions. [cite: `1.2.4` release notes]","message":"Prior to version 1.2.4, `h3-pyspark` functions might not robustly handle null values in input columns to UDFs, potentially leading to errors or unexpected behavior.","severity":"gotcha","affected_versions":"< 1.2.4"},{"fix":"Upgrade to `h3-pyspark` version 1.2.3 or newer, which includes a fix for this bug and improved error handling for malformed geometries. [cite: `1.2.3` release notes]","message":"The `index_shape` function in versions prior to 1.2.3 had a known bug where it might miss H3 cells for long line segments, leading to incomplete or inaccurate spatial indexing for complex geometries.","severity":"gotcha","affected_versions":"< 1.2.3"},{"fix":"Ensure that your geometry data is formatted as GeoJSON strings before passing it to functions like `h3_pyspark.index_shape`. Convert from other formats if necessary.","message":"h3-pyspark assumes that geospatial geometries are represented as GeoJSON strings within a Spark DataFrame column, rather than other formats like WKT.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"When calling H3 functions (e.g., `geo_to_h3`, `index_shape`), ensure you pass Spark Column objects (e.g., `F.col('column_name')`) or appropriate literals, not entire DataFrames or Python native types directly if the function expects a Column.","cause":"You are passing an entire DataFrame or a literal of the wrong type to an H3 function that expects a Spark Column expression.","error":"TypeError: Invalid argument, not a string or column: DataFrame[lat: double, lng: double] of type <class 'pyspark.sql.dataframe.DataFrame'>"},{"fix":"Check the full stack trace for more specific details. Common causes include invalid H3 resolutions (e.g., negative, too high), malformed H3 indices, or corrupted/unexpected data types in the input columns. Ensure input data conforms to H3 requirements and has no unexpected nulls or invalid values.","cause":"This generic PySpark error often indicates an issue in the underlying Java/Scala code when a UDF (User Defined Function) from h3-pyspark encounters invalid input or an unhandled condition.","error":"Py4JJavaError: An error occurred while calling o.apache.spark.sql.functions.udf"},{"fix":"Verify the exact names and casing of columns in your DataFrame (e.g., `df.printSchema()`) and ensure they match the column names used in `h3_pyspark` function calls (e.g., `h3_pyspark.geo_to_h3('latitude_column', 'longitude_column', 'res_column')`).","cause":"You are trying to access a column that does not exist in your Spark DataFrame, often due to a typo or incorrect case in the column name.","error":"AnalysisException: 'No such struct field <field_name> in <schema_string>'"}]}