Calculating the preferred validation index for a k-Prototypes clustering with k clusters or computing the optimal number of clusters based on the choosen index for k-Prototype clustering. Possible validation indices are: cindex
, dunn
, gamma
, gplus
, mcclain
, ptbiserial
, silhouette
and tau
.
validation_kproto(
method = "silhouette",
object = NULL,
data = NULL,
type = "huang",
k = NULL,
lambda = NULL,
kp_obj = "optimal",
verbose = FALSE,
...
)
For computing the optimal number of clusters based on the choosen validation index for k-Prototype clustering the output contains:
optimal number of clusters (sampled in case of ambiguity)
index value of the index optimal clustering
calculated indices for \(k=2,...,k_{max}\)
if(kp_obj == "optimal") the kproto object of the index optimal clustering and if(kp_obj == "all") all kproto which were calculated
For computing the index-value for a given k-Prototype clustering the output contains:
calculated index-value
Character specifying the validation index: cindex
, dunn
, gamma
, gplus
, mcclain
, ptbiserial
, silhouette
(default) or tau
.
Object of class kproto
resulting from a call with kproto(..., keep.data=TRUE)
.
Original data; only required if object == NULL
and neglected if object != NULL
.
Character, to specify the distance for clustering; either "huang"
or "gower"
.
Vector specifying the search range for optimum number of clusters; if NULL
the range will set as 2:sqrt(n)
. Only required if object == NULL
and neglected if object != NULL
.
Factor to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables.
character either "optimal" or "all": Output of the index-optimal clustering (kp_obj == "optimal") or all computed cluster partitions (kp_obj == "all"); only required if object != NULL
.
Logical, whether additional information about process should be printed.
Further arguments passed to kproto
, like:
nstart
: If > 1 repetitive computations of kproto
with random initializations are computed.
na.rm
: Character, either "yes"
to strip NA
values for complete case analysis, "no"
to keep and ignore NA
values, "imp.internal"
to impute the NAs
within the algorithm or "imp.onestep"
to apply the algorithm ignoring the NAs
and impute them after the partition is determined.
Rabea Aschenbruck
More information about the implemented validation indices:
cindex
$$Cindex = \frac{S_w-S_{min}}{S_{max}-S_{min}}$$
For \(S_{min}\) and \(S_{max}\) it is necessary to calculate the distances between all pairs of points in the entire data set (\(\frac{n(n-1)}{2}\)).
\(S_{min}\) is the sum of the "total number of pairs of objects belonging to the same cluster" smallest distances and
\(S_{max}\) is the sum of the "total number of pairs of objects belonging to the same cluster" largest distances. \(S_w\) is the sum of the within-cluster distances.
The minimum value of the index is used to indicate the optimal number of clusters.
dunn
$$Dunn = \frac{\min_{1 \leq i < j \leq q} d(C_i, C_j)}{\max_{1 \leq k \leq q} diam(C_k)}$$
The following applies: The dissimilarity between the two clusters \(C_i\) and \(C_j\) is defined as \(d(C_i, C_j)=\min_{x \in C_i, y \in C_j} d(x,y)\) and
the diameter of a cluster is defined as \(diam(C_k)=\max_{x,y \in C} d(x,y)\).
The maximum value of the index is used to indicate the optimal number of clusters.
gamma
$$Gamma = \frac{s(+)-s(-)}{s(+)+s(-)}$$
Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities.
\(s(+)\) is the number of concordant comparisons and \(s(-)\) is the number of discordant comparisons.
A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less (resp. strictly greater) than a between-cluster dissimilarity.
The maximum value of the index is used to indicate the optimal number of clusters.
gplus
$$Gplus = \frac{2 \cdot s(-)}{\frac{n(n-1)}{2} \cdot (\frac{n(n-1)}{2}-1)}$$
Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities.
\(s(-)\) is the number of discordant comparisons and a comparison is named discordant if a within-cluster
dissimilarity is strictly greater than a between-cluster dissimilarity.
The minimum value of the index is used to indicate the optimal number of clusters.
mcclain
$$McClain = \frac{\bar{S}_w}{\bar{S}_b}$$
\(\bar{S}_w\) is the sum of within-cluster distances divided by the number of within-cluster distances and
\(\bar{S}_b\) is the sum of between-cluster distances divided by the number of between-cluster distances.
The minimum value of the index is used to indicate the optimal number of clusters.
ptbiserial
$$Ptbiserial = \frac{(\bar{S}_b-\bar{S}_w) \cdot (\frac{N_w \cdot N_b}{N_t^2})^{0.5}}{s_d}$$
\(\bar{S}_w\) is the sum of within-cluster distances divided by the number of within-cluster distances and
\(\bar{S}_b\) is the sum of between-cluster distances divided by the number of between-cluster distances.
\(N_t\) is the total number of pairs of objects in the data, \(N_w\) is the total number of pairs of
objects belonging to the same cluster and \(N_b\) is the total number of pairs of objects belonging to different clusters.
\(s_d\) is the standard deviation of all distances.
The maximum value of the index is used to indicate the optimal number of clusters.
silhouette
$$Silhouette = \frac{1}{n} \sum_{i=1}^n \frac{b(i)-a(i)}{max(a(i),b(i))}$$
\(a(i)\) is the average dissimilarity of the ith object to all other objects of the same/own cluster.
\(b(i)=min(d(i,C))\), where \(d(i,C)\) is the average dissimilarity of the ith object to all the other clusters except the own/same cluster.
The maximum value of the index is used to indicate the optimal number of clusters.
tau
$$Tau = \frac{s(+) - s(-)}{((\frac{N_t(N_t-1)}{2}-t)\frac{N_t(N_t-1)}{2})^{0.5}}$$
Comparisons are made between all within-cluster dissimilarities and all between-cluster dissimilarities.
\(s(+)\) is the number of concordant comparisons and \(s(-)\) is the number of discordant comparisons.
A comparison is named concordant (resp. discordant) if a within-cluster dissimilarity is strictly less
(resp. strictly greater) than a between-cluster dissimilarity.
\(N_t\) is the total number of distances \(\frac{n(n-1)}{2}\) and \(t\) is the number of comparisons
of two pairs of objects where both pairs represent within-cluster comparisons or both pairs are between-cluster
comparisons.
The maximum value of the index is used to indicate the optimal number of clusters.
Aschenbruck, R., Szepannek, G. (2020): Cluster Validation for Mixed-Type Data. Archives of Data Science, Series A, Vol 6, Issue 1. tools:::Rd_expr_doi("10.5445/KSP/1000098011/02").
Charrad, M., Ghazzali, N., Boiteau, V., Niknafs, A. (2014): NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. Journal of Statistical Software, Vol 61, Issue 6. tools:::Rd_expr_doi("10.18637/jss.v061.i06").
if (FALSE) {
# generate toy data with factors and numerics
n <- 10
prb <- 0.99
muk <- 2.5
x1 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x1 <- c(x1, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x1 <- as.factor(x1)
x2 <- sample(c("A","B"), 2*n, replace = TRUE, prob = c(prb, 1-prb))
x2 <- c(x2, sample(c("A","B"), 2*n, replace = TRUE, prob = c(1-prb, prb)))
x2 <- as.factor(x2)
x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))
x <- data.frame(x1,x2,x3,x4)
# calculate optimal number of cluster, index values and clusterpartition with Silhouette-index
val <- validation_kproto(method = "silhouette", data = x, k = 3:5, nstart = 5)
# apply k-prototypes
kpres <- kproto(x, 4, keep.data = TRUE)
# calculate cindex-value for the given clusterpartition
cindex_value <- validation_kproto(method = "cindex", object = kpres)
}
Run the code above in your browser using DataLab