{"id":3509,"library":"hdbscan","title":"HDBSCAN Clustering","description":"hdbscan is a clustering algorithm developed by Campello, Moulavi, and Zimek that extends DBSCAN by converting it into a hierarchical clustering algorithm, then using a technique to extract a flat partitioning from the hierarchy. It handles varying density clusters and can identify noise points. The current version is 0.8.42, with frequent minor releases addressing bugs and adding small features.","status":"active","version":"0.8.42","language":"en","source_language":"en","source_url":"https://github.com/scikit-learn-contrib/hdbscan","tags":["clustering","machine-learning","unsupervised-learning","density-based","scikit-learn-compatible"],"install":[{"cmd":"pip install hdbscan","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core numerical operations and data structures.","package":"numpy"},{"reason":"Scientific computing functionalities, particularly for sparse matrices and spatial structures.","package":"scipy"},{"reason":"Provides base classes, utilities, and potentially some distance metrics. hdbscan aims to be scikit-learn compatible.","package":"scikit-learn"}],"imports":[{"note":"`HDBSCAN_` is an internal alias for the Cython implementation; the public-facing and stable API is `HDBSCAN`.","wrong":"import hdbscan; model = hdbscan.HDBSCAN_()","symbol":"HDBSCAN","correct":"from hdbscan import HDBSCAN"}],"quickstart":{"code":"import numpy as np\nfrom hdbscan import HDBSCAN\nfrom sklearn.datasets import make_blobs\n\n# Generate sample data\ndata, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.6,\n                       random_state=0)\n\n# Initialize HDBSCAN model\n# min_cluster_size is crucial for defining what constitutes a cluster\n# min_samples controls how conservative the clustering is, higher values mean more points are declared noise\nclusterer = HDBSCAN(min_cluster_size=15, min_samples=5, prediction_data=True)\n\n# Fit and predict clusters\nclusterer.fit(data)\n\nprint(f\"Number of clusters found: {len(np.unique(clusterer.labels_)) - (1 if -1 in clusterer.labels_ else 0)}\")\nprint(f\"First 10 labels: {clusterer.labels_[:10]}\")\n\n# You can also get probabilities (soft clusters)\n# print(f\"First 10 membership probabilities: {clusterer.probabilities_[:10]}\")\n","lang":"python","description":"This quickstart demonstrates how to use `hdbscan.HDBSCAN` to perform clustering on sample data. It initializes the model with key parameters `min_cluster_size` and `min_samples`, then fits the data and retrieves cluster labels. Labels of -1 indicate noise points."},"warnings":[{"fix":"Upgrade hdbscan to version 0.8.38 or later using `pip install --upgrade hdbscan`.","message":"HDBSCAN versions prior to 0.8.38 (specifically 0.8.37) had known incompatibilities with NumPy 2.x, leading to build failures. Upgrading to NumPy 2.x requires hdbscan 0.8.38 or newer.","severity":"breaking","affected_versions":"<0.8.38"},{"fix":"Upgrade to Python 3.8 or a newer supported version (e.g., Python 3.10, 3.11, 3.12).","message":"Python 3.7 support was officially deprecated and dropped starting from version 0.8.38.post2. Users on Python 3.7 will not receive new updates or fixes for hdbscan.","severity":"deprecated","affected_versions":">=0.8.38.post2"},{"fix":"Experiment with different values, potentially using grid search or visual inspection of clusterings (e.g., using `clusterer.condensed_tree_.plot()` for insight) to find optimal parameters for your specific dataset and problem.","message":"The `min_cluster_size` and `min_samples` parameters are critical and highly sensitive to the dataset. Incorrect values can lead to over-clustering, under-clustering, or too many noise points. `min_cluster_size` defines the smallest group to be considered a cluster, while `min_samples` (similar to DBSCAN's `minPts`) affects the density threshold for core points. `cluster_selection_epsilon` can also significantly impact results, particularly for merging clusters at different densities.","severity":"gotcha","affected_versions":"All"},{"fix":"Be aware that results from versions 0.8.38 and later may not be directly comparable to those from earlier versions due to this algorithmic enhancement. Re-evaluate models if upgrading.","message":"The internal 'branch detection' algorithm, which significantly improves handling of long, flaring clusters, was introduced in version 0.8.38. This change fundamentally alters how the algorithm processes the hierarchy and may produce different clustering results compared to previous versions, even with identical parameters.","severity":"gotcha","affected_versions":"<0.8.38"},{"fix":"Upgrade to hdbscan 0.8.42 or later if you depend on accurate outlier scores.","message":"Calculations for outlier scores (`hdbscan.outlier_scores_`) were fixed in version 0.8.42. Users relying on these scores in earlier versions might have received incorrect or unreliable values.","severity":"gotcha","affected_versions":"<0.8.42"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}