{"id":5703,"library":"pyspark-hnsw","title":"PySpark HNSW Library","description":"pyspark-hnsw is a Python library that provides a distributed implementation of Hierarchical Navigable Small Worlds (HNSW) for Approximate Nearest Neighbor (ANN) search on Apache Spark. It enables efficient vector similarity search on large datasets within a PySpark environment, leveraging Spark's distributed processing capabilities. The current stable version available on PyPI is 1.1.0, with a moderate release cadence, including minor updates in recent months.","status":"active","version":"1.1.0","language":"en","source_language":"en","source_url":"https://github.com/jelmerk/hnswlib/tree/master/hnswlib-pyspark","tags":["pyspark","ml","ann","vector-search","hnsw","similarity-search","distributed-computing"],"install":[{"cmd":"pip install pyspark-hnsw","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Required for distributed processing and Spark integration.","package":"pyspark","optional":false},{"reason":"The underlying C++ HNSW library with Python bindings.","package":"hnswlib","optional":false}],"imports":[{"note":"The top-level package for import is `pyspark_hnsw`, not `hnswlib_pyspark` or similar sub-modules.","wrong":"from hnswlib_pyspark import HnswIndex","symbol":"HnswIndex","correct":"from pyspark_hnsw import HnswIndex"}],"quickstart":{"code":"from pyspark import SparkConf, SparkContext\nfrom pyspark.sql import SparkSession\nfrom pyspark_hnsw import HnswIndex\nimport numpy as np\nimport os\n\n# Configure Spark (local mode for example)\nconf = SparkConf().setAppName(\"HnswQuickstart\").setMaster(\"local[*]\")\nsc = SparkContext(conf=conf)\nspark = SparkSession(sc)\n\n# Create some sample data with 128-dimensional vectors\ndata = [(i, [float(x) for x in np.random.rand(128)]) for i in range(1000)]\ndf = spark.createDataFrame(data, [\"id\", \"vector\"])\n\n# Define a path for the index (local or distributed filesystem like HDFS/S3)\n# Ensure this path is writable and accessible by Spark workers\nindex_path = \"hnsw_index_test_dir\"\n\n# Clean up previous index if it exists for repeatable runs\nif os.path.exists(index_path):\n    import shutil\n    shutil.rmtree(index_path)\n\n# Build the HNSW index\nhnsw_index = HnswIndex(spark, \"id\", \"vector\", index_path) \\\n    .setM(16) \\\n    .setEf(100) \\\n    .setNumPartitions(10) \\\n    .setDistanceType(\"cosine\") \\\n    .build(df)\n\n# Define a query vector\nquery_vector = [float(x) for x in np.random.rand(128)]\nnum_neighbors = 5\n\n# Find nearest neighbors\nresult = hnsw_index.findNearestNeighbors(query_vector, num_neighbors)\nresult.show()\n\n# Stop the Spark session\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to initialize a Spark session, create a sample DataFrame with vector data, build an HNSW index using `HnswIndex`, and then perform a nearest neighbor search. Remember to configure Spark appropriately for your environment (e.g., local, YARN, Kubernetes) and ensure the `index_path` is accessible by all Spark workers."},"warnings":[{"fix":"Check both PyPI and GitHub for the most up-to-date information. If installing from source, pin to a specific commit or tag.","message":"There is a discrepancy between the latest PyPI version (1.1.0) and the latest GitHub release (1.2.1). Ensure you are aware of which version you are installing and its associated features/fixes.","severity":"gotcha","affected_versions":"1.1.0 (PyPI) vs 1.2.1 (GitHub)"},{"fix":"Monitor Spark UI for memory and CPU usage. Start with smaller datasets and conservative HNSW parameters, then scale up. Increase Spark memory allocations if OutOfMemory errors occur.","message":"Building and querying HNSW indices, especially with high dimensionality or large datasets, can be memory and CPU intensive. Adjust Spark executor memory (`spark.executor.memory`), number of partitions (`setNumPartitions`), and HNSW parameters (`setM`, `setEf`) carefully.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If upgrading to 1.2.x or later from source, test thoroughly, especially if you have custom Spark configurations or Java dependencies. Verify that no existing Spark job configurations rely on old internal class names.","message":"Version 1.2.0 (not yet on PyPI as of 1.1.0) includes a repackaging of classes (`Repackage classes to avoid JPMS issues`). While primarily affecting Java Module System users, this change might alter internal class paths or dependencies that could indirectly impact complex PySpark setups or users relying on specific internal JAR references.","severity":"breaking","affected_versions":">=1.2.0 (GitHub releases)"},{"fix":"Always use a distributed file system path (e.g., `s3a://your-bucket/index_name`, `hdfs://namenode/user/index_name`) for `index_path` when running on a Spark cluster. For local testing, ensure the path exists and has write permissions.","message":"The `index_path` used to build the index must be accessible and writable by all Spark executors, and it should typically point to a distributed file system like HDFS, S3, or similar. Using a local path will store the index only on the driver or the first executor, which is not suitable for distributed use.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify the schema of your input DataFrame, especially the vector column. Convert vectors to `list[float]` or `np.array` before creating the DataFrame if needed.","message":"Ensure your vectors are in a format compatible with `pyspark-hnsw`, typically `ArrayType(FloatType)`. Mismatched data types can lead to errors during index building or querying.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}