{"library":"pyspark-hnsw","title":"PySpark HNSW Library","description":"pyspark-hnsw is a Python library that provides a distributed implementation of Hierarchical Navigable Small Worlds (HNSW) for Approximate Nearest Neighbor (ANN) search on Apache Spark. It enables efficient vector similarity search on large datasets within a PySpark environment, leveraging Spark's distributed processing capabilities. The current stable version available on PyPI is 1.1.0, with a moderate release cadence, including minor updates in recent months.","language":"python","status":"active","last_verified":"Sat May 16","install":{"commands":["pip install pyspark-hnsw"],"cli":null},"imports":["from pyspark_hnsw import HnswIndex"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"from pyspark import SparkConf, SparkContext\nfrom pyspark.sql import SparkSession\nfrom pyspark_hnsw import HnswIndex\nimport numpy as np\nimport os\n\n# Configure Spark (local mode for example)\nconf = SparkConf().setAppName(\"HnswQuickstart\").setMaster(\"local[*]\")\nsc = SparkContext(conf=conf)\nspark = SparkSession(sc)\n\n# Create some sample data with 128-dimensional vectors\ndata = [(i, [float(x) for x in np.random.rand(128)]) for i in range(1000)]\ndf = spark.createDataFrame(data, [\"id\", \"vector\"])\n\n# Define a path for the index (local or distributed filesystem like HDFS/S3)\n# Ensure this path is writable and accessible by Spark workers\nindex_path = \"hnsw_index_test_dir\"\n\n# Clean up previous index if it exists for repeatable runs\nif os.path.exists(index_path):\n    import shutil\n    shutil.rmtree(index_path)\n\n# Build the HNSW index\nhnsw_index = HnswIndex(spark, \"id\", \"vector\", index_path) \\\n    .setM(16) \\\n    .setEf(100) \\\n    .setNumPartitions(10) \\\n    .setDistanceType(\"cosine\") \\\n    .build(df)\n\n# Define a query vector\nquery_vector = [float(x) for x in np.random.rand(128)]\nnum_neighbors = 5\n\n# Find nearest neighbors\nresult = hnsw_index.findNearestNeighbors(query_vector, num_neighbors)\nresult.show()\n\n# Stop the Spark session\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to initialize a Spark session, create a sample DataFrame with vector data, build an HNSW index using `HnswIndex`, and then perform a nearest neighbor search. Remember to configure Spark appropriately for your environment (e.g., local, YARN, Kubernetes) and ensure the `index_path` is accessible by all Spark workers.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-16","installed_version":"1.1.0","pypi_latest":"1.1.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":100,"avg_install_s":1.6,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"18.0M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.6,"import_time_s":null,"mem_mb":null,"disk_size":"18M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"19.8M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.7,"import_time_s":null,"mem_mb":null,"disk_size":"20M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"11.7M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.5,"import_time_s":null,"mem_mb":null,"disk_size":"12M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"11.5M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.5,"import_time_s":null,"mem_mb":null,"disk_size":"12M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":"17.5M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"pyspark-hnsw","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"18M"}]}}