wrappers: (Generalised Canonical Correlation Analysis

Description

Wrapper function to perform (sparse) (regularized) Generalised Canonical Correlation Analysis (discriminant analysis), a generalised approach for the integration of multiple datasets.

Usage

wrapper.rgcca(  
                blocks, 
                design = NULL, 
                ncomp = rep(2, length(blocks)), 
                tau = "optimal",
                scheme = "centroid", 
                scale = TRUE, 
                bias = FALSE, 
                max.iter = 1000,
                tol = .Machine$double.eps, 
                verbose = FALSE, 
                 near.zero.var = FALSE
                 )
                 
wrapper.sgcca(
            blocks, 
            design = NULL, 
            penalty = NULL, 
            ncomp = rep(2, length(blocks)),
            keep = NULL, 
            scheme = "centroid",
            scale = TRUE, 
            bias = FALSE, 
            max.iter = 1000,
            tol = .Machine$double.eps, 
            verbose = FALSE, 
            near.zero.var = FALSE
           )
           
wrapper.sgccda(
               blocks, 
               Y, 
               design = NULL, 
               ncomp = rep(2, length(blocks)),                            
               keep = NULL, 
               scheme = "centroid", 
               scale = TRUE, 
               bias = FALSE, 
               max.iter = 1000,
               tol = .Machine$double.eps, 
               verbose = FALSE, 
               near.zero.var = FALSE
               )

Arguments

blocks

a list of data sets (called 'blocks') matching on the same samples. Data in the list should be arranged in samples x variables, with samples order matching in all data sets. NAs are not allowed.

for wrapper.sgccda only: a factor or a class vector for the discrete outcome

design

numeric matrix of size (number of blocks) x (number of blocks) with only 0 or 1 values. A value of 1 (0) indicates a relationship (no relationship) between the blocks to be modelled. For wrapper.sgccda the Y outcome should not be added in the

ncomp

numeric vector of length the number of blocks in blocks. The number of components to include in the model for each block (does not necessarily need to take the same value for each block). By default set to 2 per block.

tau

for wrapper.rgcca only: numeric vector of length the number of blocks in data. Each regularization parameter will be applied on each block and takes the value between 0 (no regularisation) and 1. If tau = "optimal" the shrinkage

penalty

for wrapper.sgcca and wrapper.sgccda only: numeric vector of length the number of blocks in data. Each penalty parameter will be applied on each block and takes the value between 0 (no variable selected) and 1 (all v

keep

for wrapper.sgcca wrapper.sgccda only: a list of integer values for each block specifying the number of variables to select on each specified component. Each element of the list corresponds to a block and is of length the number

scheme

Either "horst", "factorial" or "centroid" (Default: "centroid"), see reference paper.

scale

boleean. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE)

bias

boleean. A logical value for biaised or unbiaised estimator of the var/cov (defaults to FALSE).

max.iter

integer, the maximum number of iterations.

tol

Convergence stopping value.

verbose

if set to TRUE, reports progress on computing.

near.zero.var

boolean, see the internal nearZeroVar function (should be set to TRUE in particular for data with many zero values). Setting this argument to FALSE (when appropriate) will speed up the computations. Def

Value

wrapper.rgcca, wrapper.sgcca and wrapper.sgccada return an object of class "rgcca", "sgcca" or "sgccada", a list that contains the following components:
blocksthe input data set (as a list).
designthe input design.
variatesthe sgcca components.
loadingsthe loadings for each block data set (outer weight vector).
loadings.starthe standardised loading vectors.
tauthe input tau parameter.
schemethe input scheme.
ncompthe number of components on each block.
critthe convergence criterion.
AVEIndicators of model quality based on the Average Variance Explained (AVE): AVE(for one block), AVE(outer model), AVE(inner model).
nameslist containing the names to be used for individuals and variables.
defl.matrixThe deflated matrices at the end of the algorithm
More details can be found in the references.

encoding

latin1

Details

These wrapper functions are improved versions from the functions of the package RGCCA. rGCCA is an unsupervised model is run, sGCCA is a sparse model and sGCC-DA is a supervised sparse model. In sGCC-DA the arguments design, penalty, keep, are specified for the blocks data in the input data blocks only.

References

Tenenhaus A. and Tenenhaus M., (2011), Regularized Generalized Canonical Correlation Analysis, Psychometrika, Vol. 76, Nr 2, pp 257-284. Schafer J. and Strimmer K., (2005), A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4:32. Tenenhaus A., Phillipe C., Guillemot, V., Le Cao K-A., Grill J., Frouin, V. (2014) Variable Selection For Generalized Canonical Correlation Analysis. Biostatistics, 15(3): 569-83. O. P. Gunther, H. Shin, R. T. Ng, W. R. McMaster, B. M. McManus, P. A. Keown, S.J. Tebbutt, K-A. Le Cao, (2014) Novel multivariate methods for integration of genomics and proteomics data: Applications in a kidney transplant rejection study, OMICS: A journal of integrative biology, 18(11), 682-95.

Examples

Run this code

## RGCCA 
# --------------
data(nutrimouse)
# need to unmap Y for an unsupervised analysis, where Y is included as a data block in data
Y = unmap(nutrimouse$diet)
data = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid, Y = Y)
# with this design, all blocks are connected
design = matrix(c(0,1,1,1,0,1,1,1,0), ncol = 3, nrow = 3, 
                byrow = TRUE, dimnames = list(names(data), names(data)))

nutrimouse.rgcca <- wrapper.rgcca(blocks = data,
                                         design = design,
                                         tau = "optimal",
                                         ncomp = c(2, 2, 1),
                                         scheme = "centroid",
                                         verbose = FALSE)

# blocks should specify the block data set where the sample plot can be performed 
# (ideally when there are >= 2 components!)
# we indicate the diet variable colors.
plotIndiv(nutrimouse.rgcca, blocks = c(1,2), group = nutrimouse$diet, plot.ellipse = TRUE)

# have a look at the looadings
head(nutrimouse.rgcca$loadings[[1]])
head(nutrimouse.rgcca$loadings[[2]])
head(nutrimouse.rgcca$loadings[[3]])


## sGCCA
# -------------
# same data as above but sparse approach

# version 1 using the penalisation penalty criterion
# ---
nutrimouse.sgcca <- wrapper.sgcca(blocks = data,
                                   design = design,
                                   penalty = c(0.3, 0.5, 1),
                                   ncomp = c(2, 2, 1),
                                   scheme = "centroid",
                                   verbose = FALSE, 
                                   bias = FALSE)

# In plotIndiv we indicate the diet variable colors and the blocks to be plotted 
# (only blocks with comp  >=2!)
plotIndiv(nutrimouse.sgcca, blocks = c(1,2), group = nutrimouse$diet, 
  plot.ellipse = TRUE)

# which variables are selected on a given component?
selectVar(nutrimouse.sgcca, comp = 1, block = 1)
selectVar(nutrimouse.sgcca, comp = 1, block = 2)

# variable plot on the selected variables
plotVar(nutrimouse.sgcca, col = color.mixo(1:2), cex = c(2,2))

# version 2 using the keep penalty criterion (number of variables to keep)
# it is a list per block and per component, need to specify all variables for the 
# Y 'outcome' here 
# (see below for sgccda code, which is more appropriate)
# ----
nutrimouse.sgcca <- wrapper.sgcca(blocks = data,
                                  design = design,
                                  ncomp = c(2, 2, 1),
                                  # for keep: each element of the list corresponds to a block 
                                  # and is of length the # comp per block
                                  keep = list(c(10,10), c(15,15), c(ncol(Y))),
                                  scheme = "centroid",
                                  verbose = FALSE, 
                                  bias = FALSE)


# In plotIndiv we indicate the diet variable colors and the blocks to be plotted 
# (only blocks with comp  >=2!)
plotIndiv(nutrimouse.sgcca, blocks = c(1,2), group = nutrimouse$diet, 
  plot.ellipse = TRUE)

# which variables are selected on a given component?
selectVar(nutrimouse.sgcca, comp = 1, block = 1)
selectVar(nutrimouse.sgcca, comp = 1, block = 2)


## sGCC-DA
# -------------
data(nutrimouse)
Y = nutrimouse$diet
data = list(gene = nutrimouse$gene, lipid = nutrimouse$lipid)
design = matrix(c(0,1,1,1,0,1,1,1,0), ncol = 3, nrow = 3, byrow = TRUE)

nutrimouse.sgccda <- wrapper.sgccda(blocks = data,
                                    Y = Y,
                                    design = design,
                                    keep = list(c(10,10), c(15,15)),
                                    ncomp = c(2, 2, 1),
                                    scheme = "centroid",
                                    verbose = FALSE,
                                    bias = FALSE)

plotIndiv(nutrimouse.sgccda, blocks = c(1,2), group = nutrimouse$diet, 
  plot.ellipse = TRUE)

# which variables are selected on a given component?
selectVar(nutrimouse.sgccda, comp = 1, block = 1)
selectVar(nutrimouse.sgccda, comp = 1, block = 2)

# variable plot on the selected variables
plotVar(nutrimouse.sgccda, col = color.mixo(1:2), cex = c(2,2))

Run the code above in your browser using DataLab