DR: Downhill Riding (DR) Procedure

Description

Downhill riding procedure for selecting optimal tuning parameters in clustering algorithms, using an (in)stability probe.

Usage

DR(X, method, minPts = 3, theta = 0.9, B = 500, lb = -30, ub = 10)

Value

A list containing the following components:

P_opt: the value of the optimal parameter. If the method is DBSCAN, then P_opt is optimal \(\epsilon\). If the method is TRUST, then P_opt is optimal \(\delta\).
ACD_matrix: a matrix that returns ACD for different values of a tuning parameter. If the method is DBSCAN, then the tuning parameter is \(\epsilon\). If the method is TRUST, then the tuning parameter is \(\delta\).

Arguments

X: an \(n\times k\) matrix where columns are \(k\) objects to be clustered, and each object contains n observations (objects could be a set of time series).
method: the clustering method to be used -- currently either “TRUST” Ciampi_etal_2010funtimes or “DBSCAN” Ester_etal_1996funtimes. If the method is DBSCAN, then set MinPts and optimal \(\epsilon\) is selected using DR. If the method is TRUST, then set theta, and optimal \(\delta\) is selected using DR.
minPts: the minimum number of samples in an \(\epsilon\)-neighborhood of a point to be considered as a core point. The minPts is to be used only with the DBSCAN method. The default value is 3.
theta: connectivity parameter \(\theta \in (0,1)\), which is to be used only with the TRUST method. The default value is 0.9.
B: number of random splits in calculating the Average Cluster Deviation (ACD). The default value is 500.
lb, ub: endpoints for a range of search for the optimal parameter.

Author

Xin Huang, Yulia R. Gel

Details

Parameters lb,ub are endpoints for the search for the optimal parameter. The parameter candidates are calculated in a way such that \(P:= 1.1^x , x \in {lb,lb+0.5,lb+1.0,...,ub}\). Although the default range of search is sufficiently wide, in some cases lb,ub can be further extended if a warning message is given.

For more discussion on properties of the considered clustering algorithms and the DR procedure see Huang_etal_2016;textualfuntimes and Huang_etal_2018_riding;textualfuntimes.

References

Examples

Run this code

if (FALSE) {
## example 1
## use iris data to test DR procedure

data(iris)  
require(clue)  # calculate NMI to compare the clustering result with the ground truth
require(scatterplot3d)

Data <- scale(iris[,-5])
ground_truth_label <- iris[,5]

# perform DR procedure to select optimal eps for DBSCAN 
# and save it in variable eps_opt
eps_opt <- DR(t(Data), method="DBSCAN", minPts = 5)$P_opt   

# apply DBSCAN with the optimal eps on iris data 
# and save the clustering result in variable res
res <- dbscan(Data, eps = eps_opt, minPts =5)$cluster  

# calculate NMI to compare the clustering result with the ground truth label
clue::cl_agreement(as.cl_partition(ground_truth_label),
                   as.cl_partition(as.numeric(res)), method = "NMI") 
# visualize the clustering result and compare it with the ground truth result
# 3D visualization of clustering result using variables Sepal.Width, Sepal.Length, 
# and Petal.Length
scatterplot3d(Data[,-4],color = res)
# 3D visualization of ground truth result using variables Sepal.Width, Sepal.Length,
# and Petal.Length
scatterplot3d(Data[,-4],color = as.numeric(ground_truth_label))


## example 2
## use synthetic time series data to test DR procedure

require(funtimes)
require(clue) 
require(zoo)

# simulate 16 time series for 4 clusters, each cluster contains 4 time series
set.seed(114) 
samp_Ind <- sample(12,replace=F)
time_points <- 30
X <- matrix(0,nrow=time_points,ncol = 12)
cluster1 <- sapply(1:4,function(x) arima.sim(list(order = c(1, 0, 0), ar = c(0.2)),
                                             n = time_points, mean = 0, sd = 1))
cluster2 <- sapply(1:4,function(x) arima.sim(list(order = c(2 ,0, 0), ar = c(0.1, -0.2)),
                                             n = time_points, mean = 2, sd = 1))
cluster3 <- sapply(1:4,function(x) arima.sim(list(order = c(1, 0, 1), ar = c(0.3), ma = c(0.1)),
                                             n = time_points, mean = 6, sd = 1))

X[,samp_Ind[1:4]] <- t(round(cluster1, 4))
X[,samp_Ind[5:8]] <- t(round(cluster2, 4))
X[,samp_Ind[9:12]] <- t(round(cluster3, 4))


# create ground truth label of the synthetic data
ground_truth_label = matrix(1, nrow = 12, ncol = 1) 
for(k in 1:3){
    ground_truth_label[samp_Ind[(4*k - 4 + 1):(4*k)]] = k
}

# perform DR procedure to select optimal delta for TRUST
# and save it in variable delta_opt
delta_opt <- DR(X, method = "TRUST")$P_opt 

# apply TRUST with the optimal delta on the synthetic data 
# and save the clustering result in variable res
res <- CSlideCluster(X, Delta = delta_opt, Theta = 0.9)  

# calculate NMI to compare the clustering result with the ground truth label
clue::cl_agreement(as.cl_partition(as.numeric(ground_truth_label)),
                   as.cl_partition(as.numeric(res)), method = "NMI")

# visualize the clustering result and compare it with the ground truth result
# visualization of the clustering result obtained by TRUST
plot.zoo(X, type = "l", plot.type = "single", col = res, xlab = "Time index", ylab = "")
# visualization of the ground truth result 
plot.zoo(X, type = "l", plot.type = "single", col = ground_truth_label,
         xlab = "Time index", ylab = "")
}

Run the code above in your browser using DataLab