Calculations of module preservation statistics between independent data sets.
modulePreservation(
multiData,
multiColor,
multiWeights = NULL,
dataIsExpr = TRUE,
networkType = "unsigned",
corFnc = "cor",
corOptions = "use = 'p'",
referenceNetworks = 1,
testNetworks = NULL,
nPermutations = 100,
includekMEallInSummary = FALSE,
restrictSummaryForGeneralNetworks = TRUE,
calculateQvalue = FALSE,
randomSeed = 12345,
maxGoldModuleSize = 1000,
maxModuleSize = 1000,
quickCor = 1,
ccTupletSize = 2,
calculateCor.kIMall = FALSE,
calculateClusterCoeff = FALSE,
useInterpolation = FALSE,
checkData = TRUE,
greyName = NULL,
goldName = NULL,
savePermutedStatistics = TRUE,
loadPermutedStatistics = FALSE,
permutedStatisticsFile = if (useInterpolation) "permutedStats-intrModules.RData"
else "permutedStats-actualModules.RData",
plotInterpolation = TRUE,
interpolationPlotFile = "modulePreservationInterpolationPlots.pdf",
discardInvalidOutput = TRUE,
parallelCalculation = FALSE,
verbose = 1, indent = 0)
expression data or adjacency data
in multi-set format (see checkSets
). A vector of
lists, one per set. Each set must contain a component data
that contains the expression or adjacency
data.
If expression data are used,
rows correspond to samples and columns to genes or probes. In case of adjacencies, each data
matrix
should be a symmetric matrix ith entries between 0 and 1 and unit diagonal.
Each component of the outermost list should be
named.
a list in which every component is a vector giving the module labels of genes in
multiExpr
. The components must be named using the same names that are used in multiExpr
; these
names are used top match labels to expression data sets. See details.
optional weights, only when multiData
contains expression data.
If given, must be in the multi-set format (see checkSets
) and
weights for each set must have the same dimensions as the corresponding set in multiData
. The weights are used in
correlation calculations that involve multiData
, and are supplied as argument weights.x
and possibly
weights.y
(where appropriate) to the correlation function specified by corFnc
.
logical: if TRUE
, multiData
will be interpreted as expression data; if
FALSE
, multiData
will be interpreted as adjacencies.
network type. Allowed values are (unique abbreviations of) "unsigned"
,
"signed"
, "signed hybrid"
. See adjacency
.
character string specifying the function to be used to calculate co-expression
similarity. Defaults to Pearson correlation. Another useful choice is bicor
.
More generally, any function returning values between -1 and 1 can be used.
character string specifying additional arguments to be passed to the function given
by corFnc
. Use "use = 'p', method = 'spearman'"
to obtain Spearman correlation.
a vector giving the indices of expression data to be used as reference networks.
Reference networks must have their module labels given in multiColor
.
a list with one component per each entry in referenceNetworks
above, giving
the test networks in which to evaluate module preservation for the corresponding reference network. If not
given, preservation will be evaluated in all networks (except each reference network). If
referenceNetworks
is of length 1, testNetworks
can also be a vector (instead of a list
containing the single vector).
specifies the number of permutations that will be calculated in the permutation test.
logical: should cor.kMEall be included in the calculated summary statistics?
Because kMEall takes into account all genes in the network, this statistic measures preservation of the full
network with respect to the eigengene of the module. This may be undesirable, hence the default is
FALSE
.
logical: should the summary statistics for general (not
correlation) networks be restricted (density to meanAdj, connectivity to cor.kIM and cor.Adj)? The default
TRUE
corresponds to published work.
logical: should q-values (local FDR estimates) be calculated? Package qvalue must be installed for this calculation. Note that q-values may not be meaningful when the number of modules is small and/or most modules are preserved.
seed for the random number generator. If NULL
, the seed will not be set. If
non-NULL
and the random generator has been initialized prior to the function call, the latter's state
is saved and restored upon exit
maximum size of the "gold" module, i.e., the random sample of all network genes.
maximum module size used for calculations. Modules larger than maxModuleSize
will be reduced by randomly sampling maxModuleSize
genes.
number between 0 and 1 specifying the handling of missing data in calculation of
correlation. Zero means exact but potentially slower calculations; one means potentially faster
calculations, but with potentially inaccurate results if the proportion of missing data is large. See
cor
for more details.
tuplet size for co-clustering calculations.
logical: should cor.kMEall be calculated? This option is only valid for
adjacency input. If FALSE
, cor.kIMall will not be calculated, potentially saving significant amount
of time if the input adjacencies are large and contain many modules.
logical: should statistics based on the clustering coefficient be calculated? While these statistics may be interesting, the calculations are also computationally expensive.
logical: should data be checked for excessive number of missing entries? See
goodSamplesGenesMS
for details.
label used for unassigned genes. Traditionally such genes are labeled by grey color or
numeric label 0. These values are the default when multiColor
contains character or numeric vectors,
respectively.
label used for the "module" representing a random sample of the whole network.
Traditionally such genes are labeled by gold color or
numeric label 0.1. These values are the default when greyName
is character and numeric,
respectively. If these values conflict with the module labels in multiColor
, they should be set to something not present
in multiColor
.
logical: should calculated permutation statistics be saved? Saved statistics may be re-used if the calculation needs to be repeated.
file name to save the permutation statistics into.
logical: should permutation statistics be loaded? If a previously executed calculation needs to be repeated, loading permutation study results can cut the calculation time many-fold.
logical: should permutation statistics be calculated by interpolating an artificial set of evenly spaced modules? This option may potentially speed up the calculations, but it restricts calculations to density measures.
logical: should interpolation plots be saved? If interpolation is used (see
useInterpolation
above), the function can optionally generate diagnostic plots that can be used to
assess whether the interpolation makes sense.
file name to save the interpolation plots into.
logical: should output columns containing no valid data be discarded? This
option may be useful when input dataIsExpr
is FALSE
and some of the output statistics cannot
be calculated. This option causes such statistics to be dropped from output.
logical: should calculations be done in parallel? Note that parallel
calculations are turned off by default and will lead to somewhat DIFFERENT results than serial calculations
because the random seed is set differently. For the calculation to actually run in parallel mode, a call to
enableWGCNAThreads
must be made before this function is called.
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
The function returns a nested list of preservation statistics. At the top level, the list components are:
observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of quality statistics. All logarithms are in base 10.
observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of density and connectivity preservation statistics. All logarithms are in base 10.
observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of cross-tabulation statistics. All logarithms are in base 10.
observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of module separability in the reference network. All logarithms are in base 10.
observed values, Z scores, p-values, Bonferoni-corrected p-values, and (optionally) q-values of module separability in the test network. All logarithms are in base 10.
results of individual permutations, useful for diagnostics
Each of the lists observed, Z, log.p, log.pBonf, optionally q, observedOverlapCounts and observedFisherPvalues is structured as a 2-level list where the outer components correspond to reference sets and the inner components to tests sets. As an example, preservation$observed[[1]][[2]] contains the density and connectivity preservation statistics for the preservation of set 1 modules in set 2, that is set 1 is the reference set and set 2 is the test set. preservation$observed[[1]][[2]] is a data frame in which each row corresponds to a module in the reference network 1 plus one row for the unassigned objects, and one row for a "module" that contains randomly sampled objects and that represents a whole-network average. Each column corresponds to a statistic as indicated by the column name.
This function calculates module preservation statistics pair-wise between given reference sets and all
other sets in multiExpr
. Reference sets must have their corresponding module assignment specified in
multiColor
; module assignment is optional for test sets. Individual expression sets and their module
labels are matched using names
of the corresponding components in multiExpr
and
multiColor
.
For each reference-test pair, the function calculates module preservation statistics that
measure how well the modules of the reference set are preserved in the test set.
If the multiColor
also contains module assignment for the test set, the calculated statistics also
include cross-tabulation statistics that make use of the test module assignment.
For each reference-test pair, the function only uses genes (columns of the data
component of each
component of multiExpr
) that are in common between the reference and test set. Columns are matched by
column names, so column names must be valid.
In addition to preservation statistics, the function also calculates several statistics of module quality, that is measures of how well-defined modules are in the reference set. The quality statistics are calculated with respect to genes in common with with a test set; thus the function calculates a set of quality statistics for each reference-test pair. This may be somewhat counter-intuitive, but it allows a direct comparison of corresponding quality and preservation statistics.
The calculated p-values are determined from the Z scores of individual measures under assumption of normality. No p-value is calculated for the Zsummary measures. Bonferoni correction to the number of tested modules. Because the p-values for strongly preserved modules are often extremely low, the function reports natural logarithms (base e) of the p-values. However, q-values are reported untransformed since they are calculated that way in package qvalue.
Missing data are removed (but see quickCor
above).
Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath, to appear
Network construction and module detection functions in the WGCNA package such as
adjacency
, blockwiseModules
; rudimentary cleaning in
goodSamplesGenesMS
; the WGCNA implementation of correlation in cor
.