pgmmEM: Model-Based Clustering & Classification Using PGMMs

Description

Carries out model-based clustering or classification using parsimonious Gaussian mixture models. AECM algorithms are used for parameter estimation. The BIC or the ICL is used for model selection.

Usage

pgmmEM(x,rG=1:2,rq=1:2,class=NULL,icl=FALSE,zstart=2,cccStart=TRUE,loop=3,zlist=NULL,
		modelSubset=NULL,seed=123456,tol=0.1,relax=FALSE)

Arguments

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

The range of values for the number of components.

The range of values for the number of factors.

class

If NULL then model-based clustering is performed. If a vector with length equal to the number of observations, then model-based classification is performed. In this latter case, the ith entry of class is either zero, indicating t

icl

If TRUE then the ICL is used for model selection. Otherwise, the BIC is used for model selection.

zstart

A number that controls what starting values are used: (1) Random; (2) k-means; or (3) user-specified via zlist.

cccStart

If TRUE then random starting values are put through the CCC model and the resulting group memberships are used as starting values for the models specified in modelSubset. Only relevant for zstart=1. See Examples.

loop

A number specifying how many different random starts should be used. Only relevant for zstart=1.

zlist

A list comprising vectors of initial classifications such that zlist[[g]] gives the g-component starting values. Only relevant for zstart=3. See Examples.

modelSubset

A vector of strings giving the models to be used.

seed

A number giving the pseudo-random number seed to be used.

tol

A number specifying the epsilon value for the convergence criteria used in the AECM algorithms. For each algorithm, the criterion is based on the difference between the log-likelihood at an iteration and an asymptotic estimate of the log-likelihood at tha

relax

By default, the number of factors cannot exceed half the number of variables. Setting relax=TRUE relaxes this constraint and allows the number of factors to take any value up to three-quarters the number of variables.

Value

An object of class pgmm is a list with components:
mapA vector of integers, taking values in the range rG, indicating the maximum a posteriori classifications for the best model.
modelA string giving the name of the best model.
gThe number of components for the best model.
qThe number of factors for the best model.
zhatA matrix giving the raw values upon which map is based.
loadThe factor loadings matrix (Lambda) for the best model.
noisevThe Psi matrix for the best model.
plot_infoA list that stores information to enable plot.
summ_infoA list that stores information to enable summary.
In addition, the object will contain one of the following, depending on the value of icl.
bicA number giving the BIC for each model.
iclA number giving the ICL for each model.

Details

The data x are either clustered using the PGMM approach of McNicholas & Murphy (2005, 2008, 2010) or classified using the method described by McNicholas (2010). In either case, all 12 covariance structures given by McNicholas & Murphy (2010) are available. Parameter estimation is carried out using AECM algorithms, as described in McNicholas et al. (2010). Either the BIC or the ICL is used for model-selection. The number of AECM algorithms to be run depends on the range of values for the number of components rG, the range of values for the number of factors rq, and the number of models in modelSubset. Starting values are very important to the successful operation of these algorithms and so care must be taken in the interpretation of results.

References

Paul D. McNicholas and T. Brendan Murphy (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705-2712.

Paul D. McNicholas (2010). Model-based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference 140(5), 1175-1181.

Paul D. McNicholas, T. Brendan Murphy, Aaron F. McDaid and Dermot Frost (2010). Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis 54(3), 711-723.

Paul D. McNicholas and T. Brendan Murphy (2008). Parsimonious Gaussian mixture models. Statistics and Computing 18(3), 285-296.

Paul D. McNicholas and T. Brendan Murphy (2005). Parsimonious Gaussian mixture models. Technical Report 05/11, Department of Statistics, Trinity College Dublin.

Examples

Run this code

# Wine clustering example with three random starts and the CUU model.
 data("wine")
 x<-wine[,-1]
 x<-scale(x)
 wine_clust<-pgmmEM(x,rG=1:4,rq=1:4,zstart=1,loop=3,modelSubset=c("CUU"))
 table(wine[,1],wine_clust$map)

# Wine clustering example with custom starts and the CUU model.
 data("wine")
 x<-wine[,-1]
 x<-scale(x)
 hcl<-hclust(dist(x)) 
 z<-list()
 for(g in 1:4){ 
	 z[[g]]<-cutree(hcl,k=g)
 } 
 wine_clust2<-pgmmEM(x,1:4,1:4,zstart=3,modelSubset=c("CUU"),zlist=z)
 table(wine[,1],wine_clust2$map)
 print(wine_clust2)
 summary(wine_clust2)

# Olive oil classification by region (there are three regionss), with two-thirds of
# the observations taken as having known group memberships, using the CUC, CUU and 
# UCU models. 
 data("olive") 
 x<-olive[,-c(1,2)] 
 x<-scale(x) 
 cls<-olive[,1]
 for(i in 1:dim(olive)[1]){
	 if(i%%3==0){cls[i]<-0}
 }
 olive_class<-pgmmEM(x,rG=3:3,rq=4:6,cls,modelSubset=c("CUC","CUU", 
  "CUCU"),relax=TRUE)
 cls_ind<-(cls==0) 
 table(olive[cls_ind,1],olive_class$map[cls_ind])

# Another olive oil classification by region, but this time suppose we only know
# two-thirds of the labels for the first two areas but we suspect that there might 
# be a third or even a fourth area. 
 data("olive") 
 x<-olive[,-c(1,2)] 
 x<-scale(x) 
 cls2<-olive[,1]
 for(i in 1:dim(olive)[1]){
   if(i%%3==0||i>420){cls2[i]<-0}
 }
 olive_class2<-pgmmEM(x,2:4,4:6,cls2,modelSubset=c("CUU"),relax=TRUE)
 cls_ind2<-(cls2==0) 
 table(olive[cls_ind2,1],olive_class2$map[cls_ind2])
 
# Coffee clustering example using k-means starting values for all 12
# models with the ICL being used for model selection instead of the BIC.
 data("coffee")
 x<-coffee[,-c(1,2)]
 x<-scale(x)
 coffee_clust<-pgmmEM(x,rG=2:3,rq=1:3,zstart=2,icl=TRUE)
 table(coffee[,1],coffee_clust$map)
 plot(coffee_clust)
 plot(coffee_clust,onlyAll=TRUE)
  
# Coffee clustering example using k-means starting values for the UUU model, i.e., the
# mixture of factor analyzers model, for G=2 and q=1.
 data("coffee")
 x<-coffee[,-c(1,2)]
 x<-scale(x)
 coffee_clust_mfa<-pgmmEM(x,2:2,1:1,zstart=2,modelSubset=c("UUU"))
 table(coffee[,1],coffee_clust_mfa$map)

Run the code above in your browser using DataLab