This fast implementation of HDBSCAN (Campello et al., 2013) computes the
hierarchical cluster tree representing density estimates along with the
stability-based flat cluster extraction. HDBSCAN essentially computes the
hierarchy of all DBSCAN* clusterings, and
then uses a stability-based extraction method to find optimal cuts in the
hierarchy, thus producing a flat solution.
HDBSCAN performs the following steps:
Compute mutual reachability distance mrd between points
(based on distances and core distances).
Use mdr as a distance measure to construct a minimum spanning tree.
Prune the tree using stability.
Extract the clusters.
Additional, related algorithms including the "Global-Local Outlier Score
from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015)
is available in function glosh()
and the ability to cluster based on instance-level constraints (see
section 5.3 of Campello et al. 2015) are supported. The algorithms only need
the parameter minPts
.
Note that minPts
not only acts as a minimum cluster size to detect,
but also as a "smoothing" factor of the density estimates implicitly
computed from HDBSCAN.
coredist()
: The core distance is defined for each point as
the distance to the MinPts
's neighbor. It is a density estimate.
mrdist()
: The mutual reachability distance is defined between two points as
mrd(a, b) = max(coredist(a), coredist(b), dist(a, b))
. This distance metric is used by
HDBSCAN. It has the effect of increasing distances in low density areas.
predict()
assigns each new data point to the same cluster as the nearest point
if it is not more than that points core distance away. Otherwise the new point
is classified as a noise point (i.e., cluster ID 0).