HDBSCAN Clustering

0.8.42 · active · verified Sat Apr 11

hdbscan is a clustering algorithm developed by Campello, Moulavi, and Zimek that extends DBSCAN by converting it into a hierarchical clustering algorithm, then using a technique to extract a flat partitioning from the hierarchy. It handles varying density clusters and can identify noise points. The current version is 0.8.42, with frequent minor releases addressing bugs and adding small features.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `hdbscan.HDBSCAN` to perform clustering on sample data. It initializes the model with key parameters `min_cluster_size` and `min_samples`, then fits the data and retrieves cluster labels. Labels of -1 indicate noise points.

import numpy as np
from hdbscan import HDBSCAN
from sklearn.datasets import make_blobs

# Generate sample data
data, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.6,
                       random_state=0)

# Initialize HDBSCAN model
# min_cluster_size is crucial for defining what constitutes a cluster
# min_samples controls how conservative the clustering is, higher values mean more points are declared noise
clusterer = HDBSCAN(min_cluster_size=15, min_samples=5, prediction_data=True)

# Fit and predict clusters
clusterer.fit(data)

print(f"Number of clusters found: {len(np.unique(clusterer.labels_)) - (1 if -1 in clusterer.labels_ else 0)}")
print(f"First 10 labels: {clusterer.labels_[:10]}")

# You can also get probabilities (soft clusters)
# print(f"First 10 membership probabilities: {clusterer.probabilities_[:10]}")

view raw JSON →