hdbscan: hdbscan

Description

Implemenation of the hdbscan algorithm.

Usage

hdbscan(edges, neighbors = NULL, minPts = 20, K = 5, threads = NULL,
  verbose = getOption("verbose", TRUE))

Arguments

edges

An edge matrix of the type returned by buildEdgeMatrix or, alternatively, a largeVis object.

neighbors

An adjacency matrix of the type returned by randomProjectionTreeSearch. Must be specified unless edges is a largeVis object.

minPts

The minimum number of points in a cluster.

The number of points in the core neighborhood. (See details.)

threads

Maximum number of threads. Determined automatically if NULL (the default). It is unlikely that this parameter should ever need to be adjusted. It is only available to make it possible to abide by the CRAN limitation that no package use more than two cores.

verbose

Verbosity.

Value

An object of type hdbscan with the following fields:

'clusters': A vector of the cluster membership for each vertex. Outliers are given NA
'probabilities': A vector of the degree of each vertex' membership. This is calculated by standardizing each vertex' \(lambda_p\) against the \(lambda_birth\) and \(lambda_death\) of the cluster.
'glosh': A vector of GLOSH outlier scores for each node assigned to a cluster. NA for nodes not in a cluster.
'tree': The minimum spanning tree used to generate the clustering.
'hierarchy': A representation of the condensed cluster hierarchy.
'call': The call.

The hierarchy describes the complete post-condensation structure of the tree:

'nodemembership': The cluster ID of the vertex's immediate parent, after condensation.
'lambda': \(\lambda_p\) for each vertex.
'parent': The cluster ID of each cluster's parent.
'stability': The cluster's stability, taking into account child-node stabilities.
'selected': Whether the cluster was selected.
'coredistances': The core distance determined for each vertex.
'lamba_birth': \(\lambda_b\) for each cluster.
'lambda_deaeth': \(\lambda_d\) for each cluster.

Details

The hyperparameter K controls the size of core neighborhoods. The algorithm measures the density around a point as 1 / the distance between that point and its \(k\)th nearest neighbor. A low value of K is similar to clustering nearest neighbors rather than based on density. A high value of K may cause the algorithm to miss some (usually contrived) clustering patterns, such as where clusters are made up of points arranged in lines to form shapes.

The function must be provided sufficient nearest-neighbor data for whatever is specified for \(k\). If \(k\) = 5, for example, the edge matrix (and neighbor matrix, if specified) must contain data on at least 5 neighbors for each point. This should not be problematic in typical use in connection with largeVis, which is ordinarily run with a far higher \(k\)-value than hdbscan.

References

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Examples

Run this code

# NOT RUN {
library(largeVis)
library(clusteringdatasets)  # See https://github.com/elbamos/clusteringdatasets
data(spiral)
dat <- as.matrix(spiral[, 1:2])
neighbors <- randomProjectionTreeSearch(t(dat), K = 10, tree_threshold = 100,
                                       max_iter = 5, threads = 1)
edges <- buildEdgeMatrix(t(dat), neighbors)
clusters <- hdbscan(edges, neighbors = neighbors, verbose = FALSE, threads = 1)

# Calling largeVis while setting sgd_batches to 1 is
# the simplest way to generate the data structures neeeded for hdbscan
spiralVis <- largeVis(t(dat), K = 10, tree_threshold = 100, max_iter = 5,
                      sgd_batches = 1, threads = 1)
clusters <- hdbscan(spiralVis, verbose = FALSE, threads = 1)
# The gplot function helps to visualize the clustering
largeHighDimensionalDataset <- matrix(rnorm(50000), ncol = 50)
vis <- largeVis(largeHighDimensionalDataset)
clustering <- hdbscan(vis)
gplot(clustering, t(vis$coords))
# }

Run the code above in your browser using DataLab