panpca: Principal component analysis of a pan-matrix

Description

Computes a principal component decomposition of a pan-matrix, with possible scaling and weightings.

Usage

panpca(pan.matrix, scale = 0, weights = rep(1, dim(pan.matrix)[2]))

Arguments

pan.matrix

A Panmat object, see panMatrix for details.

scale

An optional scale to control how copy numbers should affect the distances.

weights

Vector of optional weights of gene clusters.

Value

A Panpca object is returned from this function. This is a small (S3) extension of a list with elements Evar, Scores, Loadings, Scale and Weights.Evar is a vector with one number for each principal component. It contains the relative explained variance for each component, and it always sums to 1.0. This value indicates the importance of each component, and it is always in descending order, the first component being the most important. The Evar is typically the first result you look at after a PCA has been computed, as it indicates how many components (directions) you need to capture the bulk of the total variation in the data.Scores is a matrix with one column for each principal component and one row for each genome. The columns are ordered corresponding to the elements in Evar. The scores are the coordinates of each genome in the principal component space. See plotScores for how to visualize genomes in the score-space.Loadings is a matrix with one column for each principal component and one row for each gene cluster. The columns are ordered corresponding to the elements in Evar. The loadings are the contribution from each original gene cluster to the principal component directions. NOTE: Only gene clusters having a non-zero variance is used in a PCA. Gene clusters with the same value for every genome have no impact and are discarded from the Loadings. See plotLoadings for how to visualize gene clusters in the loading space.Scale and Weights are copies of the corresponding input arguments.The generic functions plot.Panpca and summary.Panpca are available for Panpca objects.

Details

A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix. The principal components will in this case be linear combinations of the gene clusters. One major idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a high-dimensional space spanned by all gene clusters, we look for a few ‘smart’ combinations of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.

The scale can be used to control how copy number differences play a role in the PCA. Usually we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to 2 (or more) copies is less. Prior to computing the PCA, the pan.matrix is transformed according to the following affine mapping: If the original value in pan.matrix is x, and x is not 0, then the transformed value is 1 + (x-1)*scale. Note that with scale=0.0 (default) this will result in 1 regardless of how large x was. In this case the PCA only distinguish between presence and absence of gene clusters. If scale=1.0 the value x is left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between 1 copy and 0 copies. For any scale between 0.0 and 1.0 the transformed value is shrunk towards 1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA should be affected, and to what degree, by differences in copy numbers beyond 1.

The PCA can also up- or downweight some clusters compared to others. The vector weights must contain one value for each column in pan.matrix. The default is to use flat weights, i.e. all clusters count equal. See geneWeights for alternative weighting strategies.

The functions plotScores and plotLoadings can be used to visualize the results of panpca.

Examples

Run this code

# Loading two Panmat objects in the micropan package
data(list=c("Mpneumoniae.blast.panmat","Mpneumoniae.domain.panmat"),package="micropan")

# Panpca based on a BLAST clustering Panmat object
ppca.blast <- panpca(Mpneumoniae.blast.panmat)
plot(ppca.blast) # The generic plot function
plotScores(ppca.blast) # A score-plot

# Panpca based on domain sequence clustering Panmat object
w <- geneWeights(Mpneumoniae.domain.panmat,type="shell")
ppca.domains <- panpca(Mpneumoniae.domain.panmat,scale=0.5,weights=w)
summary(ppca.domains)
plotLoadings(ppca.domains)

Run the code above in your browser using DataLab