MclustDRsubsel: Subset selection for GMMDR directions based on BIC.

Description

Implements a subset selection method for selecting the relevant directions spanning the dimension reduction subspace for visualizing the clustering or classification structure obtained from a finite mixture of Gaussian densities.

Usage

MclustDRsubsel(object, G = 1:9,
                       modelNames = mclust.options("emModelNames"), 
                       …,
                       bic.stop = 0, bic.cutoff = 0, 
                       mindir = 1, 
                       verbose = interactive())

Arguments

object

An object of class 'MclustDR' resulting from a call to MclustDR.

An integer vector specifying the numbers of mixture components or clusters.

modelNames

A vector of character strings indicating the models to be fitted. See mclustModelNames for a description of the available models.

…

Further arguments passed through Mclust or MclustDA.

bic.stop

A criterion to terminate the search. If maximal BIC difference is less than bic.stop then the algorithm stops. Two tipical values are:

: 0: algorithm stops when the BIC difference becomes negative (default)
: -Inf: algorithm continues until all directions have been selected

bic.cutoff

A value specifying how to select simplest ``best'' model within bic.cutoff from the maximum value achieved. Setting this to 0 (default) simply select the model with the largest BIC difference.

mindir

An integer value specifying the minimum number of directions to be estimated.

verbose

A logical or integer value specifying if and how much detailed information should be reported during the iterations of the algorithm. Possible values are:

: 0 or FALSE: no trace info is shown;
: 1 or TRUE: a trace info is shown at each step of the search;
: 2: a more detailed trace info is is shown.

Value

An object of class 'MclustDRsubsel' which inherits from 'MclustDR', so it has the same components of the latter plus the following:

basisx

The basis of the estimated dimension reduction subspace expressed in terms of the original variables.

std.basisx

The basis of the estimated dimension reduction subspace expressed in terms of the original variables standardized to have unit standard deviation.

Details

The GMMDR method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the clustering or classification structure contained in the data. This is implemented in MclustDR.

The MclustDRsubsel function implements the greedy forward search algorithm discussed in Scrucca (2010) to prune the set of all GMMDR directions. The criterion used to select the relevant directions is based on the BIC difference between a clustering model and a model in which the feature proposal has no clustering relevance. The steps are the following:

1. Select the first feature to be the one which maximizes the BIC difference between the best clustering model and the model which assumes no clustering, i.e. a single component.

2. Select the next feature amongst those not previously included, to be the one which maximizes the BIC difference.

3. Iterate the previous step until all the BIC differences for the inclusion of a feature become less than bic.stop.

At each step, the search over the model space is performed with respect to the model parametrisation and the number of clusters.

References

Scrucca, L. (2010) Dimension reduction for model-based clustering. Statistics and Computing, 20(4), pp. 471-484.

Scrucca, L. (2014) Graphical Tools for Model-based Mixture Discriminant Analysis. Advances in Data Analysis and Classification, 8(2), pp. 147-165

Examples

Run this code

# NOT RUN {
# clustering
data(crabs, package = "MASS")
x <- crabs[,4:8]
class <- paste(crabs$sp, crabs$sex, sep = "|")
mod <- Mclust(x)
table(class, mod$classification)
dr <- MclustDR(mod)
summary(dr)
plot(dr)
drs <- MclustDRsubsel(dr)
summary(drs)
table(class, drs$class)
plot(drs, what = "scatterplot")
plot(drs, what = "pairs")
plot(drs, what = "contour")
plot(drs, what = "boundaries")
plot(drs, what = "evalues")

# classification
data(banknote)
da <- MclustDA(banknote[,2:7], banknote$Status)
table(banknote$Status, predict(da)$class)
dr <- MclustDR(da)
summary(dr)
drs <- MclustDRsubsel(dr)
summary(drs)
table(banknote$Status, predict(drs)$class)
plot(drs, what = "scatterplot")
plot(drs, what = "classification")
plot(drs, what = "boundaries")
# }

Run the code above in your browser using DataLab