akclustr: Anchored k-medoids clustering

Description

Given a list of trajectories and a functional method, this function clusters the trajectories into a k number of groups. If a vector of two numbers is given, the function determines the best solution from those options based on the Cali<U+0144>ski-Harabasz criterion.

Usage

akclustr(traj, id_field = FALSE, method = "linear",
k = c(3,6), crit="Silhouette", verbose = TRUE, quality_plot=FALSE)

Arguments

traj

[matrix (numeric)]: longitudinal data. Each row represents an individual trajectory (of observations). The columns show the observations at consecutive time steps.

id_field

[numeric or character] Whether the first column of the traj is a unique (id) field. Default: FALSE. If TRUE the function recognizes the second column as the first time points.

method

[character] The parametric initialization strategy. Currently, the only available method is a linear method, set as "linear". This uses the time-dependent linear regression lines and the resulting groups are order in the order on increasing slopes.

[integer or vector (numeric)] either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined from which the best solution will be determined. In either case, the minimum number of clusters is 3. The default is c(3,6).

crit

[character] a string specifying the type of the criterion to use for assessing the quality of the cluster solutions, when k is a vector of two values (as above). Default: crit="Silhouette", use the average Silhouette width (Rousseeuw P. J. 1987). Using the "Silhouette" criterion, the optimal value of k can be determined as the elbow point of the curve. Other valid criterion is the "Calinski_Harabasz" (Cali<U+0144>ski T. & Harabasz J. 1974) in which the maximum score represents the point of optimality. Having determined the optimal k, the function can then be re-run, using the exact (optimal) value of k.

verbose

to suppress output messages (to the console) during clustering. Default: TRUE.

quality_plot

Whether to show plot of quality criteria across different values of k. Default: FALSE.

Value

generates an akobject consisting of the cluster solutions at the specified values of k. Also, the graphical plot of the quality scores of the cluster solutions.

Details

This function works by first approximating the trajectories based on the chosen parametric forms (e.g. linear), and then partitions the original trajectories based on the form groupings, in similar fashion to k-means clustering (Genolini et al. 2015). The key distinction of akmedoids compared with existing longitudinal approaches is that both the initial starting points as well as the subsequent cluster centers (as the iteration progresses) are based the selection of observations (medoids) as oppose to centroids.

References

1. Genolini, C. et al. (2015) kml and kml3d: R Packages to Cluster Longitudinal Data. Journal of Statistical Software, 65(4), 1-34. URL http://www.jstatsoft.org/v65/i04/.

2. Rousseeuw P. J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math 20:53<U+2013>65.

3. Cali<U+0144>ski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun. Stat. 3:1-27.

Examples

Run this code

# NOT RUN {
data(traj)

trajectry <- data_imputation(traj, id_field = TRUE, method = 2,
replace_with = 1, fill_zeros = FALSE)

trajectry <- props(trajectry$CompleteData, id_field = TRUE)

print(trajectry)

output <- akclustr(trajectry, id_field = TRUE,
method = "linear", k = c(3,7), crit='Calinski_Harabasz',
verbose = FALSE, quality_plot=FALSE)

print(output)

# }

Run the code above in your browser using DataLab