HDBSCAN Clustering
hdbscan is a clustering algorithm developed by Campello, Moulavi, and Zimek that extends DBSCAN by converting it into a hierarchical clustering algorithm, then using a technique to extract a flat partitioning from the hierarchy. It handles varying density clusters and can identify noise points. The current version is 0.8.42, with frequent minor releases addressing bugs and adding small features.
Warnings
- breaking HDBSCAN versions prior to 0.8.38 (specifically 0.8.37) had known incompatibilities with NumPy 2.x, leading to build failures. Upgrading to NumPy 2.x requires hdbscan 0.8.38 or newer.
- deprecated Python 3.7 support was officially deprecated and dropped starting from version 0.8.38.post2. Users on Python 3.7 will not receive new updates or fixes for hdbscan.
- gotcha The `min_cluster_size` and `min_samples` parameters are critical and highly sensitive to the dataset. Incorrect values can lead to over-clustering, under-clustering, or too many noise points. `min_cluster_size` defines the smallest group to be considered a cluster, while `min_samples` (similar to DBSCAN's `minPts`) affects the density threshold for core points. `cluster_selection_epsilon` can also significantly impact results, particularly for merging clusters at different densities.
- gotcha The internal 'branch detection' algorithm, which significantly improves handling of long, flaring clusters, was introduced in version 0.8.38. This change fundamentally alters how the algorithm processes the hierarchy and may produce different clustering results compared to previous versions, even with identical parameters.
- gotcha Calculations for outlier scores (`hdbscan.outlier_scores_`) were fixed in version 0.8.42. Users relying on these scores in earlier versions might have received incorrect or unreliable values.
Install
-
pip install hdbscan
Imports
- HDBSCAN
from hdbscan import HDBSCAN
Quickstart
import numpy as np
from hdbscan import HDBSCAN
from sklearn.datasets import make_blobs
# Generate sample data
data, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.6,
random_state=0)
# Initialize HDBSCAN model
# min_cluster_size is crucial for defining what constitutes a cluster
# min_samples controls how conservative the clustering is, higher values mean more points are declared noise
clusterer = HDBSCAN(min_cluster_size=15, min_samples=5, prediction_data=True)
# Fit and predict clusters
clusterer.fit(data)
print(f"Number of clusters found: {len(np.unique(clusterer.labels_)) - (1 if -1 in clusterer.labels_ else 0)}")
print(f"First 10 labels: {clusterer.labels_[:10]}")
# You can also get probabilities (soft clusters)
# print(f"First 10 membership probabilities: {clusterer.probabilities_[:10]}")