kmodes Clustering Library
Python implementations of the k-modes and k-prototypes clustering algorithms for clustering categorical data. It is currently at version 0.12.2 and sees active development with several releases per year.
Warnings
- breaking Dropped support for missing values (np.NaN) in the input matrix (X) starting from version 0.11.1, following scikit-learn's approach. Users must now handle missing data manually by imputation or removal.
- breaking Python 3.4 support was dropped in version 0.10.2. Official support for Python 3.10 was added in 0.12.0. Ensure your Python environment is compatible (Python 3.6+ is generally safe).
- breaking The minimum `scikit-learn` version was upgraded to 0.22 in kmodes version 0.11.0. Older `scikit-learn` versions may cause compatibility issues or `AttributeError`.
- gotcha When using `KPrototypes`, one or more of your numerical feature columns may contain string values, leading to `TypeError: '<' not supported between instances of 'str' and 'float'`.
- gotcha For `KPrototypes`, you must explicitly specify which column indices are categorical using the `categorical` argument. If not provided, it assumes all are numerical, or can raise an error if mixed data types are present without specification.
- gotcha A `ModuleNotFoundError` (e.g., `No module named 'kmodes.kmodes'`) can occur if your working Python file is named `kmodes.py`, as it might shadow the installed `kmodes` package.
- gotcha Encountering `ValueError: Clustering algorithm could not initialize` is often an indication that the data and chosen parameters (e.g., `n_clusters`, `init` method) are not suitable. It's not necessarily a bug.
Install
-
pip install kmodes
Imports
- KModes
from kmodes.kmodes import KModes
- KPrototypes
from kmodes.kprototypes import KPrototypes
Quickstart
import numpy as np
from kmodes.kmodes import KModes
# Generate random categorical data (e.g., 100 samples, 10 features, 20 unique categories per feature)
data = np.random.choice(20, (100, 10))
# Initialize KModes with 4 clusters, Huang initialization, 5 initialization runs
km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)
# Fit the model and predict clusters
clusters = km.fit_predict(data)
# Print the cluster centroids
print("Cluster Centroids:\n", km.cluster_centroids_)
print("Assigned Clusters:\n", clusters[:10]) # Display first 10 assigned clusters