This function decorrelates the training dataset by adjusting data for the effects of latent factors of dependence.
decorrelate.train(data.train, nbf = NULL, maxnbfactors=12, diagnostic.plot = FALSE,
min.err = 0.001, verbose = TRUE,EM = TRUE, maxiter = 15,...)
A list containing the training dataset with the following components: x
is the n x p matrix of explanatory variables, where n stands for the training sample size and
p for the number of explanatory variables ; y
is a numeric vector giving the group of each individual numbered from 1 to K.
Number of factors. If nbf = NULL
, the number of factors is estimated. nbf
can
also be set to a positive integer value. If nbf = 0
, the data are not factor-adjusted.
The maximum number of factors. Default is maxnbfactors=12
.
If diagnostic.plot =TRUE
, the values of the variance inflation criterion are
plotted for each number of factors. Default is diagnostic.plot =FALSE
. This option might be helpful
to manually determine the optimal number of factors.
Threshold of convergence of the algorithm criterion. Default is min.err=0.001.
Print out number of factors and values of the objective criterion along the iterations. Default is TRUE
.
The method used to estimate the parameters of the factor model. If EM=TRUE
, parameters are estimated by an EM algorithm. Setting EM=TRUE
is recommended when the number of covariates exceeds the number of observations. If EM=FALSE
, the parameters are estimated by maximum-likelihood using factanal
. Default is EM=TRUE
Maximum number of iterations for estimation of the factor model.
Other arguments that can be passed in the cv.glmnet
and glmnet
functions from glmnet package. These functions are used to estimate individual group probabilities. Modifying these parameters should not affect the decorrelation procedure. However, the argument nfolds
in cv.glmnet
is set to 10 by default and should be reduced (minimum 3) for large datasets, in order to decrease the computation time of decorrelation.train
.
Returns a list with the following elements:
Group means estimated after iterative decorrelation
Decorrelated training data
Estimation of the factor model parameters: specific variance
Estimation of the factor model parameters: loadings
Scores of the trainings individuals on the factors
Recall of group variable of training data
Internal value (estimation of individual probabilities for the training dataset)
Friedman, J., Hastie, T. and Tibshirani, R. (2010), Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1-22.
Friguet, C., Kloareg, M. and Causeur, D. (2009), A factor model approach to multiple testing under dependence. Journal of the American Statistical Association, 104:488, 1406-1415.
Perthame, E., Friguet, C. and Causeur, D. (2015), Stability of feature selection in classification issues for high-dimensional correlated data, Statistics and Computing.
# NOT RUN {
data(data.train)
res0 = decorrelate.train(data.train,nbf=3) # when the number of factors is forced
res1 = decorrelate.train(data.train) # when the optimal number of factors is unknown
# }
Run the code above in your browser using DataLab