kmeansvar: k-means clustering of variables

Description

Iterative relocation algorithm of k-means type which performs a partitionning of a set of variables. Variables can be quantitative, qualitative or a mixture of both. The center of a cluster of variables is a synthetic variable but is not a 'mean' as for classical k-means. This synthetic variable is the first principal component calculated by PCAmix. PCAmix is defined for a mixture of qualitative and quantitative variables and includes ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) as special cases. The homogeneity of a cluster of variables is defined as the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster, which is in all cases a numerical variable. Missing values are replaced by means for quantitative variables and by zeros in the indicator matrix for qualitative variables.

Usage

kmeansvar(X.quanti = NULL, X.quali = NULL, init, iter.max = 150,
  nstart = 1, matsim = FALSE)

Arguments

X.quanti

a numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).

X.quali

a categorical matrix of data, or an object that can be coerced to such a matrix (such as a character vector, a factor or a data frame with all factor columns).

init

either the number of clusters or an initial partition (a vector of integers indicating the cluster to which each variable is allocated). If init is a number, a random set of (distinct) columns in X.quali and X.quanti is chosen as the initial cluster centers.

iter.max

the maximum number of iterations allowed.

nstart

if init is a number, nstart corresponds with the number of random sets used in the process.

matsim

boolean, if 'TRUE', the matrices of similarities between variables in same cluster are calculated.

Value

var

a list of matrices of squared loadings i.e. for each cluster of variables, the squared loadings on first principal component of PCAmix. For quantitative variables (resp. qualitative), squared loadings are the squared correlations (resp. the correlation ratios) with the first PC (the cluster center).

sim

a list of matrices of similarities i.e. for each cluster, similarities between their variables. The similarity between two variables is defined as a square cosine: the square of the Pearson correlation when the two variables are quantitative; the correlation ratio when one variable is quantitative and the other one is qualitative; the square of the canonical correlation between two sets of dummy variables, when the two variables are qualitative. sim is 'NULL if matsim is FALSE.

cluster

a vector of integers indicating the cluster to which each variable is allocated.

wss

the within-cluster sum of squares for each cluster: the sum of the correlation ratio (for qualitative variables) and the squared correlation (for quantitative variables) between the variables and the center of the cluster.

the pourcentage of homogeneity which is accounted by the partition in k clusters.

size

the number of variables in each cluster.

scores

a n by k numerical matrix which contains the k cluster centers. The center of a cluster is a synthetic variable: the first principal component calculated by PCAmix. The k columns of scores contain the scores of the n observations units on the first PCs of the k clusters.

coef

a list of the coefficients of the linear combinations defining the synthetic variable of each cluster.

Details

If the quantitative and qualitative data are in a same dataframe, the function splitmix can be used to extract automatically the qualitative and the quantitative data in two separated dataframes.

References

Chavent, M., Liquet, B., Kuentz, V., Saracco, J. (2012), ClustOfVar: An R Package for the Clustering of Variables. Journal of Statistical Software, Vol. 50, pp. 1-16.

Examples

Run this code

# NOT RUN {
data(decathlon)
#choice of the number of clusters
tree <- hclustvar(X.quanti=decathlon[,1:10])
stab <- stability(tree,B=60)
#a random set of variables is chosen as the initial cluster centers, nstart=10 times
part1 <- kmeansvar(X.quanti=decathlon[,1:10],init=5,nstart=10)
summary(part1)
#the partition from the hierarchical clustering is chosen as initial partition
part_init<-cutreevar(tree,5)$cluster
part2<-kmeansvar(X.quanti=decathlon[,1:10],init=part_init,matsim=TRUE)
summary(part2)
part2$sim

# }

Run the code above in your browser using DataLab