A principal component analysis (PCA) can be computed for any matrix, also a pan-matrix.
The principal components will in this case be linear combinations of the gene clusters. One major
idea behind PCA is to truncate the space, e.g. instead of considering the genomes as points in a
high-dimensional space spanned by all gene clusters, we look for a few ‘smart’ combinations
of the gene clusters, and visualize the genomes in a low-dimensional space spanned by these directions.
The scale can be used to control how copy number differences play a role in the PCA. Usually
we assume that going from 0 to 1 copy of a gene is the big change of the genome, and going from 1 to
2 (or more) copies is less. Prior to computing the PCA, the pan.matrix is transformed according
to the following affine mapping: If the original value in pan.matrix is x, and x
is not 0, then the transformed value is 1 + (x-1)*scale. Note that with scale=0.0
(default) this will result in 1 regardless of how large x was. In this case the PCA only
distinguish between presence and absence of gene clusters. If scale=1.0 the value x is
left untransformed. In this case the difference between 1 copy and 2 copies is just as big as between
1 copy and 0 copies. For any scale between 0.0 and 1.0 the transformed value is shrunk towards
1, but a certain effect of larger copy numbers is still present. In this way you can decide if the PCA
should be affected, and to what degree, by differences in copy numbers beyond 1.
The PCA may also up- or downweight some clusters compared to others. The vector weights must
contain one value for each column in pan.matrix. The default is to use flat weights, i.e. all
clusters count equal. See geneWeights
for alternative weighting strategies.