hdbscan: Hierarchical DBSCAN (HDBSCAN)

Description

Fast C++ implementation of the HDBSCAN (Hierarchical DBSCAN) and its related algorithms.

Usage

hdbscan(
  x,
  minPts,
  gen_hdbscan_tree = FALSE,
  gen_simplified_tree = FALSE,
  verbose = FALSE
)
# S3 method for hdbscan
print(x, ...)
# S3 method for hdbscan
plot(
  x,
  scale = "suggest",
  gradient = c("yellow", "red"),
  show_flat = FALSE,
  ...
)
coredist(x, minPts)
mrdist(x, minPts, coredist = NULL)
# S3 method for hdbscan
predict(object, newdata, data, ...)

Value

hdbscan() returns object of class hdbscan with the following components:

cluster: A integer vector with cluster assignments. Zero indicates noise points.
minPts: value of the minPts parameter.
cluster_scores: The sum of the stability scores for each salient (flat) cluster. Corresponds to cluster IDs given the in "cluster" element.
membership_prob: The probability or individual stability of a point within its clusters. Between 0 and 1.
outlier_scores: The GLOSH outlier score of each point.
hc: An hclust object of the HDBSCAN hierarchy.

coredist() returns a vector with the core distance for each data point.

mrdist() returns a dist object containing pairwise mutual reachability distances.

Arguments

x: a data matrix (Euclidean distances are used) or a dist object calculated with an arbitrary distance metric.
minPts: integer; Minimum size of clusters. See details.
gen_hdbscan_tree: logical; should the robust single linkage tree be explicitly computed (see cluster tree in Chaudhuri et al, 2010).
gen_simplified_tree: logical; should the simplified hierarchy be explicitly computed (see Campello et al, 2013).
verbose: report progress.
...: additional arguments are passed on.
scale: integer; used to scale condensed tree based on the graphics device. Lower scale results in wider trees.
gradient: character vector; the colors to build the condensed tree coloring with.
show_flat: logical; whether to draw boxes indicating the most stable clusters.
coredist: numeric vector with precomputed core distances (optional).
object: clustering object.
newdata: new data points for which the cluster membership should be predicted.
data: the data set used to create the clustering object.

Author

Matt Piekenbrock

Details

This fast implementation of HDBSCAN (Campello et al., 2013) computes the hierarchical cluster tree representing density estimates along with the stability-based flat cluster extraction. HDBSCAN essentially computes the hierarchy of all DBSCAN* clusterings, and then uses a stability-based extraction method to find optimal cuts in the hierarchy, thus producing a flat solution.

HDBSCAN performs the following steps:

Compute mutual reachability distance mrd between points (based on distances and core distances).
Use mdr as a distance measure to construct a minimum spanning tree.
Prune the tree using stability.
Extract the clusters.

Additional, related algorithms including the "Global-Local Outlier Score from Hierarchies" (GLOSH; see section 6 of Campello et al., 2015) is available in function glosh() and the ability to cluster based on instance-level constraints (see section 5.3 of Campello et al. 2015) are supported. The algorithms only need the parameter minPts.

Note that minPts not only acts as a minimum cluster size to detect, but also as a "smoothing" factor of the density estimates implicitly computed from HDBSCAN.

coredist(): The core distance is defined for each point as the distance to the MinPts's neighbor. It is a density estimate.

mrdist(): The mutual reachability distance is defined between two points as mrd(a, b) = max(coredist(a), coredist(b), dist(a, b)). This distance metric is used by HDBSCAN. It has the effect of increasing distances in low density areas.

predict() assigns each new data point to the same cluster as the nearest point if it is not more than that points core distance away. Otherwise the new point is classified as a noise point (i.e., cluster ID 0).

References

Campello RJGB, Moulavi D, Sander J (2013). Density-Based Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases, PAKDD 2013, Lecture Notes in Computer Science 7819, p. 160. tools:::Rd_expr_doi("10.1007/978-3-642-37456-2_14")

Campello RJGB, Moulavi D, Zimek A, Sander J (2015). Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transactions on Knowledge Discovery from Data (TKDD), 10(5):1-51. tools:::Rd_expr_doi("10.1145/2733381")

Examples

Run this code

## cluster the moons data set with HDBSCAN
data(moons)

res <- hdbscan(moons, minPts = 5)
res

plot(res)
plot(moons, col = res$cluster + 1L)

## cluster the moons data set with HDBSCAN using Manhattan distances
res <- hdbscan(dist(moons, method = "manhattan"), minPts = 5)
plot(res)
plot(moons, col = res$cluster + 1L)

## DS3 from Chameleon
data("DS3")

res <- hdbscan(DS3, minPts = 50)
res

## Plot the simplified tree, highlight the most stable clusters
plot(res, show_flat = TRUE)

## Plot the actual clusters (noise has cluster id 0 and is shown in black)
plot(DS3, col = res$cluster + 1L, cex = .5)

Run the code above in your browser using DataLab