UMAP (Uniform Manifold Approximation and Projection)
UMAP (Uniform Manifold Approximation and Projection) is a general-purpose manifold learning and dimensionality reduction algorithm. It constructs a high-dimensional graph and then searches for a low-dimensional projection of the data that has the closest possible equivalent fuzzy topological structure. The current version is 0.5.12, with a release cadence that includes frequent patch releases and minor updates.
Warnings
- gotcha UMAP is stochastic, and results are not reproducible without setting `random_state`. This applies to both `UMAP` initialization and any subsequent operations like `transform`.
- gotcha The `n_neighbors` and `min_dist` parameters heavily influence the resulting manifold structure. Choosing appropriate values is critical for meaningful results, and defaults may not always be optimal for specific datasets.
- gotcha The `transform` method for new, out-of-sample data points performs an *approximate* projection. It is not guaranteed to perfectly preserve the relationships from the training data or match the quality of the `fit_transform` method.
- gotcha UMAP's performance relies heavily on `numba` for just-in-time compilation. Issues with `numba` installation or environment configuration (e.g., older compilers) can lead to significant performance degradation or errors.
- gotcha UMAP is not inherently scale-invariant. Features with larger scales will have a disproportionately larger influence on the distance calculations and the resulting manifold structure.
Install
-
pip install umap-learn
Imports
- UMAP
import umap reducer = umap.UMAP()
- UMAP
from umap import UMAP
Quickstart
import umap
from sklearn.datasets import make_blobs
# 1. Generate some sample data
X, y = make_blobs(n_samples=500, centers=4, cluster_std=1.0, random_state=42)
# 2. Initialize UMAP reducer
# n_neighbors: Balances local vs. global structure. Larger values preserve more global structure.
# min_dist: Controls how tightly points are packed together. Smaller values lead to denser clusters.
# n_components: Desired dimensionality of the output embedding.
# random_state: For reproducible results.
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
# 3. Fit and transform the data
embedding = reducer.fit_transform(X)
# The 'embedding' now contains the 2D projection of the original data
print(f"Original data shape: {X.shape}")
print(f"UMAP embedding shape: {embedding.shape}")
# print(embedding[:5]) # Display first 5 embedded points