Performs replications of sPLSDA on random subsamplings of the data
bootsPLS(X,Y,near.zero.var,many=50,ncomp=2,
dist = c("max.dist", "centroids.dist", "mahalanobis.dist"),
save.file,ratio,kCV=10,grid,cpus,nrepeat=1,showProgress=TRUE)
Input matrix of dimension n * p; each row is an observation vector.
Factor with at least q>2 levels.
Logical. If TRUE, a pre-screening step is performed to remove predictors with near-zero variance. See nearZeroVar
.
How many replications of the sPLS-DA analysis are to be done?
How many component are to be included in the sPLS-DA analysis?
Indicates the distance that is used to classify the samples. One of "max.dist", "centroids.dist", "mahalanobis.dist". Default is "max.dist"
If the outputs are to be saved, this argument allows you to do it at the end of each replication. A full path is expected. Convenient if you run this function on a cluster and it is killed before completion, e.g. due to a too short requested time.
Number between 0 and 1. It is the proportion of the n samples that are put aside and considered as an internal testing set. The (1-ratio)*n samples are used as a training set and the kCV
fold cross validation is performed on them. Default is 0.3
Number of fold for the cross validation. Default is 10.
A vector of value for the tuning of the keepX
parameter of sPLS-DA on each component. See spls
for more details on keepX
. Default is grid=1:min(40,ncol(X))
.
Number of cpus to use when running the code in parallel.
Number of times the Cross-Validation process is repeated for each of the many
replications. See tune.splsda
for details.
Logical. If TRUE, shows the progress of the algorithm. It also gives a list of which variables are selected on each component.
A 'bootsPLS' object is returned for which plot
, fit.model
and prediction
are available.
A 4-dimensional array. The two first dimensions consists in the confusion matrix. The third dimension is relative to the number of components ncomp
. The fourth dimension concerns the number of replication many
.
A 3-dimensional array. Loadings vector of X, for each component and each replication.
A 3-dimensional array. Gives the selected variables for each component and each replication. It is obtained by replacing each non zero value in loadings.X
by 1.
A matrix of size ncomp*p. Gives the frequency of selection for each variable on each component. It is obtained as a mean over the third dimension of selection.variable
Matrix of size many*ncomp. Gives the number of variables that have been selected on each component for each replication.
Matrix of size n*many. Gives the samples that have been used in the internal training set over the many
replications. These samples have the value 1, the others 0.
A 3-dimensional array of size n*many*ncomp. Gives the prediction for the chosen dist
of all the samples, either in the learning set or the testing set.
A list of the input data X, Y and of the distance used to classify the sample ("max.dist", "centroids.dist" or "mahalanobis.dist").
Performs replication of tune.splsda
on random subsamplings of the data and record which variables are selected on which subsamplings. It also gives a confusion matrix for each component and for each subsamplings.
Rohart et al. (2016). A Molecular Classification of Human Mesenchymal Stromal Cells. PeerJ, DOI 10.7717/peerj.1845
# NOT RUN {
data(MSC)
X=MSC$X
Y=MSC$Y
dim(X)
table(Y)
boot=bootsPLS(X=X,Y=Y,ncomp=3,many=5,kCV=5)
# saving the outputs in a Rdata file, the file is saved after each iteration
# if used on a cluster, you can use the `cpus' argument as well
save.file=paste(getwd(),"/MSC.",Sys.getpid(),".Rdata",sep="")
boot=bootsPLS(X=X,Y=Y,ncomp=3,many=5,kCV=5,save.file=save.file)
# }
Run the code above in your browser using DataLab