Learn R Programming

FisherEM (version 1.6)

bfem: The Bayesian Fisher-EM algorithm.

Description

The Bayesian Fisher-EM algorithm is built on a Bayesian formulation of the model used in the fem. It is a subspace clustering method for high-dimensional data. It is based on a Gaussian Mixture Model and on the idea that the data lives in a common and low dimensional subspace. A VEM-like algorithm estimates both the discriminative subspace and the parameters of the mixture model.

Usage

bfem(
  Y,
  K = 2:6,
  model = "AkjBk",
  method = "gs",
  crit = "icl",
  maxit.em = 100,
  eps.em = 1e-06,
  maxit.ve = 3,
  eps.ve = 1e-04,
  lambda = 1000,
  emp.bayes = T,
  init = "kmeans",
  nstart = 10,
  Tinit = c(),
  kernel = "",
  disp = FALSE,
  mc.cores = (detectCores() - 1),
  subset = NULL
)

Arguments

Y

The data matrix. Categorical variables and missing values are not allowed.

K

An integer vector specifying the numbers of mixture components (clusters) among which the model selection criterion will choose the most appropriate number of groups. Default is 2:6.

model

A vector of Bayesian discriminative latent mixture (BDLM) models to fit. There are 12 different models: "DkBk", "DkB", "DBk", "DB", "AkjBk", "AkjB", "AkBk", "AkBk", "AjBk", "AjB", "ABk", "AB". The option "all" executes the Fisher-EM algorithm on the 12 DLM models and select the best model according to the maximum value obtained by model selection criterion. Similar to fem

method

The method used for the fitting of the projection matrix associated to the discriminative subspace. Three methods are available: 'gs' (Gram-Schmidt, the original proposition), 'svd' (based on SVD, faster) and 'reg' (the Fisher criterion is rewritten as a regression problem). The 'gs' method is the default method.

crit

The model selection criterion to use for selecting the most appropriate model for the data. There are 3 possibilities: "bic", "aic" or "icl". Default is "icl".

maxit.em

The maximum number of iterations before the stop of the main EM loop in the BFEM algorithm.

eps.em

The threshold value for the likelihood differences (Aitken's criterion) to stop the BFEM algorithm.

maxit.ve

The maximum number of iterations before the stop of the VE-step loop (fixed point algorithm)

eps.ve

The threshold value for the likelihood differences (Aitken's criterion) to stop the BFEM algorithm.

lambda

The initial value for the variance of the Gaussian prior on the means in the latent space.

emp.bayes

Should the hyper-parameters (mean and variance) of the prior be updated ? Default to TRUE.

init

The initialization method for the Fisher-EM algorithm. There are 4 options: "random" for a randomized initialization, "kmeans" for an initialization by the kmeans algorithm, "hclust" for hierarchical clustering initialization or "user" for a specific initialization through the parameter "Tinit". Default is "kmeans". Notice that for "kmeans" and "random", several initializations are asked and the initialization associated with the highest likelihood is kept (see "nstart").

nstart

The number of restart if the initialization is "kmeans" or "random". In such a case, the initialization associated with the highest likelihood is kept.

Tinit

A n x K matrix which contains posterior probabilities for initializing the algorithm (each line corresponds to an individual).

kernel

It enables to deal with the n < p problem. By default, no kernel (" ") is used. But the user has the choice between 3 options for the kernel: "linear", "sigmoid" or "rbf".

disp

If true, some messages are printed during the clustering. Default is false.

mc.cores

The number of CPUs to use to fit in parallel the different models (only for non-Windows platforms). Default is the number of available cores minus 1.

subset

A positive integer defining the size of the subsample, default is NULL. In case of large data sets, it might be useful to fit a FisherEM model on a subsample of the data, and then use this model to predict cluster assignments for the whole data set. Notice that in, such a case, likelihood values and model selection criteria are computed for the subsample and not the whole data set.

Value

A list is returned:

  • K - The number of groups.

  • cls - the group membership of each individual estimated by the BFEM algorithm

  • Tinit - The initial posterior probalities used to start the algorithm

  • d - the dimension of the discriminative subspace

  • elbos - A vector containing the evolution of the variational lower bound at each iteration

  • loglik - The final value of the variational lower bound

  • n_ite - The number of iteration until convergence of the BFEM algorithm

  • P - the posterior probabilities of each individual for each group

  • U - The loading matrix which determines the orientation of the discriminative subspace

  • param - A list containing the estimated parameters of the model

    • PI - The mixture proportions

    • Sigmak - An array containing estimated cluster covariances in the latent space

    • Beta - The noise variance in each cluster

  • var_param - A list containing the variational distribution parameters

    • logtau - A n x K matrix containing the logarithm of the multinomial parameters of q(Z)

    • Varmeank - A K x d matrix containing the variational mean

    • Varcovk - A d x d x K array containing the variational covariance matrices.

  • proj - The projected data on the discriminative subspace.

  • aic - The value of the Akaike information criterion

  • bic - The value of the Bayesian information criterion

  • icl - The value of the integrated completed likelihood criterion

  • method - The method used in the F-step

  • call - The call of the function

  • crit - The model selection criterion used

See Also

fem

Examples

Run this code
# NOT RUN {
# Chang's 1983 setting
simu = simu_bfem(300, which = "Chang1983")
Y = simu$Y
res.bfem = bfem(Y, K = 2:6, model=c('AB'), init = 'kmeans', nstart = 1, 
               maxit.em = 10, eps.em = 1e-3, maxit.ve = 3, mc.cores = 2)

# }

Run the code above in your browser using DataLab