This function implements the Mixtures of Contaminated Gaussian Factor Analyzers (MCGFA) model, for model-based clustering and classification. The approach is meant to be applied on on high-dimensional, and noisy, data. A description of the model can be found in Punzo, Blostein and McNicholas (2017). Parameter estimation is performed using an Alternating Expectation-Conditional Maximization (AECM) algorithm (Meng and Van Dyk, 1997).
Besides clustering into components, this algorithm also automatically detects outliers through the use of the contaminated Gaussian distribution. So, in the end each observation is classified in a nested fashion: first into components, then into as inliers/outliers.
To specify an individual MCGFA model, one must determine: the number of components (\(G\)), the number of latent factors (\(q\)), and the model name. The model name is one of the following thirty-two options:
CCCCC, CCUCC, CUCCC, CUUCC, UCCCC, UCUCC, UUCCC, UUUCC
CCCCU, CCUCU, CUCCU, CUUCU, UCCCU, UCUCU, UUCCU, UUUCU
CCCUC, CCUUC, CUCUC, CUUUC, UCCUC, UCUUC, UUCUC, UUUUC
CCCUU, CCUUU, CUCUU, CUUUU, UCCUU, UCUUU, UUCUU, UUUUU
Explanation of the model naming scheme can be found in the next subsection, as well as instructions on how to conveniently generate of subset of the full set of models.
The user may choose to fit a single model. However, usually many models are fit to the same data, and a model selection criterion is used to determine the best one. By multiple values for the rG
, rq
and models
parameters, many models can be fit at once. The mcgfa
function then selects the best model according to the Bayesian Information Criterion (BIC).
When fitting many models to large data, parallel computation may be employed by mcgfa
, using the parallel
parameter. This parallelization is provided by the mcapply
function from the parallel
package for R.
Model Names & Constraints
The model name indicates which constraints are to be imposed on to the covariance structure of the factor analysis model, and as well as on to the parameters \(\eta\) and \(\alpha\).
Because the MCGFA is a mixture of factor analyzers model, the covariance matrix of the \(g\)-th group, \(\Sigma_g\), can be decomposed as follows:
$$\Sigma_g = \Lambda_g\Lambda_g' + \Psi_g$$
where \(\Lambda_g\) is a \(p\) by \(q\) factor loading matrix, and \(\Psi_g\) is a diagonal matrix that determines the noise variance for each of the \(p\) variables. The family of eight models is formed by introducing different sets of constraints on \(\Lambda\) and \(\Psi\).
The five-letter model names are interpreted as follows. ``C'' indicates that the constraint is imposed, and ``U'' that it is not.
\(\Lambda\) constrained to be equal across groups (\(\Lambda_g = \Lambda\))
\(\Psi\) constrained to be equal across groups (\(\Psi_g = \Psi\))
Error variances constrained to be to be equal within groups (\(\Psi_g = I\psi\))
\(\alpha\) constrained to be equal across groups (\(\alpha_g = \alpha\))
\(\eta\) constrained to be equal across groups (\(\eta_g = \eta\))
The subset of models to fit is specified in the models
argument as a character vector. For convenience, the user may also use the character 'X' in a model name to indicate a wildcard. For example, 'UUUXX'
is equivalent to the set c('UUUCC','UUUCU','UUUUC','UUUUU')
. These wildcards may be combined, so for example c('UUUUU','CCCCX')
is equivalent to c('UUUUU','CCCCU','CCCCC')
. Any duplicate models generated by wildcards will be removed.
Initialization Methods
Because the AECM algorithm cannot guarantee convergence to a global maximum, success of the algorithm in reaching a good fit depends heavily on starting values. Starting values in this case refer to the prior probability of class membership for each observation. This is represented in an \(n\)-by-\(G\) stochastic matrix \(z\), where \(n\) is the number of observations and G is the number of components. The element \(z[i,j]\) indicates the prior probability that observation \(i\) lies in group \(G\). Several different options are provided for the generation of this initialization matrix, through the init_method
argument.
The default initialization method is emEM
. In this case, several candidates for initial classification matrices are generated, and then the AECM algorithm is applied to each for a small number of iterations. Whichever candidate achieves the best BIC value is selected as the initialization for the full AECM runs. This process occurs separately for each value of \(G\), but only using one parsimonious model and one value of \(q\). The number of candidates, number of iterations, parsimonious model and \(q\) used for the emEM initialization can be specified in the ememargs
argument. Options for the emEM initialization can be provided in the ememargs
argument.
The second option is kmeans
, which uses the \(k\)-means clustering method to generate a ``hard'' initial classification. \(k\)-means is a fast, simple clustering algorithm that returns a hard classification.
Finally, using the given
option, the user may provide a specific initialization matrix for each value of G
in rG
. The initialization is provided in the init_z
argument. That argument must be a list of the same length as rG
, where each element is a matrix with N rows and number of columns corresponding the matching value of rG
. The \([i,j]-th\) element of each matrix indicates the initial probability that observation \(i\) lies in class \(j\).
Classification
This function provides two methods for model-based classification. The first is usual fully-supervised classification. For this method, the model is fit to fully labelled data (by providing a known
vector argument with no zeros). Then, after model fitting takes place, the predict
generic function can be applied to the mcgfa
object returned by this function, to predict the class labels of a matrix of unlabelled observations. This method has the advantage that as new observations arise, predictions can be made very quickly, without refitting the model.
The other form of classification is a form of semi-supervised learning. Semi-supervised classification makes use of both unlabelled and labelled data for training. The information that the unlabelled data provides about the structure of the dataset can improve classification. For this method, the known
vector argument is simply provided with elements equal to zero, indicated that the corresponding observation has no known label. The MCGFA model is then fit and the unknown elements are classified as usual.
The second method requires that the model be fitted again each time new data arrives. However, it is also completely possible to apply the predict
method a model fit using semi-supervised classification.