modulePreservation: Calculation of module preservation statistics

Description

Calculations of module preservation statistics between independent data sets.

Usage

modulePreservation(
   multiData,
   multiColor,
   multiWeights = NULL,
   dataIsExpr = TRUE,
   networkType = "unsigned", 
   corFnc = "cor",
   corOptions = "use = 'p'",
   referenceNetworks = 1, 
   testNetworks = NULL,
   nPermutations = 100, 
   includekMEallInSummary = FALSE,
   restrictSummaryForGeneralNetworks = TRUE,
   calculateQvalue = FALSE,
   randomSeed = 12345, 
   maxGoldModuleSize = 1000, 
   maxModuleSize = 1000, 
   quickCor = 1, 
   ccTupletSize = 2, 
   calculateCor.kIMall = FALSE,
   calculateClusterCoeff = FALSE,
   useInterpolation = FALSE, 
   checkData = TRUE, 
   greyName = NULL, 
   goldName = NULL,
   savePermutedStatistics = TRUE, 
   loadPermutedStatistics = FALSE, 
   permutedStatisticsFile = if (useInterpolation) "permutedStats-intrModules.RData" 
                                   else "permutedStats-actualModules.RData", 
   plotInterpolation = TRUE, 
   interpolationPlotFile = "modulePreservationInterpolationPlots.pdf", 
   discardInvalidOutput = TRUE,
   parallelCalculation = FALSE,
   verbose = 1, indent = 0)

Value

The function returns a nested list of preservation statistics. At the top level, the list components are:

quality: observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of quality statistics. All logarithms are in base 10.
preservation: observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of density and connectivity preservation statistics. All logarithms are in base 10.
accuracy: observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of cross-tabulation statistics. All logarithms are in base 10.
referenceSeparability: observed values, Z scores, log p-values, Bonferoni-corrected log p-values, and (optionally) q-values of module separability in the reference network. All logarithms are in base 10.
testSeparability: observed values, Z scores, p-values, Bonferoni-corrected p-values, and (optionally) q-values of module separability in the test network. All logarithms are in base 10.
permutationDetails: results of individual permutations, useful for diagnostics

All of the above are lists. The lists quality, preservation, referenceSeparability, and testSeparability each contain 4 or 5 components: observed contains observed values, Z contains the corresponding Z scores, log.p contains base 10 logarithms of the p-values, log.pBonf contains base 10 logarithms of the Bonferoni corrected p-values, and optionally q

contains the associated q-values. The list accuracy contains observed, Z, log.p, log.pBonf, optionally q, and additional components observedOverlapCounts and observedFisherPvalues that contain the observed matrices of overlap counts and Fisher test p-values.

Each of the lists observed, Z, log.p, log.pBonf, optionally q, observedOverlapCounts and observedFisherPvalues

is structured as a 2-level list where the outer components correspond to reference sets and the inner components to tests sets. As an example, preservation$observed[[1]][[2]] contains the density and connectivity preservation statistics for the preservation of set 1 modules in set 2, that is set 1 is the reference set and set 2 is the test set. preservation$observed[[1]][[2]] is a data frame in which each row corresponds to a module in the reference network 1 plus one row for the unassigned objects, and one row for a "module" that contains randomly sampled objects and that represents a whole-network average. Each column corresponds to a statistic as indicated by the column name.

Arguments

multiData: expression data or adjacency data in multi-set format (see checkSets). A vector of lists, one per set. Each set must contain a component data that contains the expression or adjacency data. If expression data are used, rows correspond to samples and columns to genes or probes. In case of adjacencies, each data matrix should be a symmetric matrix ith entries between 0 and 1 and unit diagonal. Each component of the outermost list should be named.
multiColor: a list in which every component is a vector giving the module labels of genes in multiExpr. The components must be named using the same names that are used in multiExpr; these names are used top match labels to expression data sets. See details.
multiWeights: optional weights, only when multiData contains expression data. If given, must be in the multi-set format (see checkSets) and weights for each set must have the same dimensions as the corresponding set in multiData. The weights are used in correlation calculations that involve multiData, and are supplied as argument weights.x and possibly weights.y (where appropriate) to the correlation function specified by corFnc.
dataIsExpr: logical: if TRUE, multiData will be interpreted as expression data; if FALSE, multiData will be interpreted as adjacencies.
networkType: network type. Allowed values are (unique abbreviations of) "unsigned", "signed", "signed hybrid". See adjacency.
corFnc: character string specifying the function to be used to calculate co-expression similarity. Defaults to Pearson correlation. Another useful choice is bicor. More generally, any function returning values between -1 and 1 can be used.
corOptions: character string specifying additional arguments to be passed to the function given by corFnc. Use "use = 'p', method = 'spearman'" to obtain Spearman correlation.
referenceNetworks: a vector giving the indices of expression data to be used as reference networks. Reference networks must have their module labels given in multiColor.
testNetworks: a list with one component per each entry in referenceNetworks above, giving the test networks in which to evaluate module preservation for the corresponding reference network. If not given, preservation will be evaluated in all networks (except each reference network). If referenceNetworks is of length 1, testNetworks can also be a vector (instead of a list containing the single vector).
nPermutations: specifies the number of permutations that will be calculated in the permutation test.
includekMEallInSummary: logical: should cor.kMEall be included in the calculated summary statistics? Because kMEall takes into account all genes in the network, this statistic measures preservation of the full network with respect to the eigengene of the module. This may be undesirable, hence the default is FALSE.
restrictSummaryForGeneralNetworks: logical: should the summary statistics for general (not correlation) networks be restricted (density to meanAdj, connectivity to cor.kIM and cor.Adj)? The default TRUE corresponds to published work.
calculateQvalue: logical: should q-values (local FDR estimates) be calculated? Package qvalue must be installed for this calculation. Note that q-values may not be meaningful when the number of modules is small and/or most modules are preserved.
randomSeed: seed for the random number generator. If NULL, the seed will not be set. If non-NULL and the random generator has been initialized prior to the function call, the latter's state is saved and restored upon exit
maxGoldModuleSize: maximum size of the "gold" module, i.e., the random sample of all network genes.
maxModuleSize: maximum module size used for calculations. Modules larger than maxModuleSize will be reduced by randomly sampling maxModuleSize genes.
quickCor: number between 0 and 1 specifying the handling of missing data in calculation of correlation. Zero means exact but potentially slower calculations; one means potentially faster calculations, but with potentially inaccurate results if the proportion of missing data is large. See cor for more details.
ccTupletSize: tuplet size for co-clustering calculations.
calculateCor.kIMall: logical: should cor.kMEall be calculated? This option is only valid for adjacency input. If FALSE, cor.kIMall will not be calculated, potentially saving significant amount of time if the input adjacencies are large and contain many modules.
calculateClusterCoeff: logical: should statistics based on the clustering coefficient be calculated? While these statistics may be interesting, the calculations are also computationally expensive.
checkData: logical: should data be checked for excessive number of missing entries? See goodSamplesGenesMS for details.
greyName: label used for unassigned genes. Traditionally such genes are labeled by grey color or numeric label 0. These values are the default when multiColor contains character or numeric vectors, respectively.
goldName: label used for the "module" representing a random sample of the whole network. Traditionally such genes are labeled by gold color or numeric label 0.1. These values are the default when greyName is character and numeric, respectively. If these values conflict with the module labels in multiColor, they should be set to something not present in multiColor.
savePermutedStatistics: logical: should calculated permutation statistics be saved? Saved statistics may be re-used if the calculation needs to be repeated.
permutedStatisticsFile: file name to save the permutation statistics into.
loadPermutedStatistics: logical: should permutation statistics be loaded? If a previously executed calculation needs to be repeated, loading permutation study results can cut the calculation time many-fold.
useInterpolation: logical: should permutation statistics be calculated by interpolating an artificial set of evenly spaced modules? This option may potentially speed up the calculations, but it restricts calculations to density measures.
plotInterpolation: logical: should interpolation plots be saved? If interpolation is used (see useInterpolation above), the function can optionally generate diagnostic plots that can be used to assess whether the interpolation makes sense.
interpolationPlotFile: file name to save the interpolation plots into.
discardInvalidOutput: logical: should output columns containing no valid data be discarded? This option may be useful when input dataIsExpr is FALSE and some of the output statistics cannot be calculated. This option causes such statistics to be dropped from output.
parallelCalculation: logical: should calculations be done in parallel? Note that parallel calculations are turned off by default and will lead to somewhat DIFFERENT results than serial calculations because the random seed is set differently. For the calculation to actually run in parallel mode, a call to enableWGCNAThreads must be made before this function is called.
verbose: integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indent: indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

Author

Rui Luo and Peter Langfelder

Details

This function calculates module preservation statistics pair-wise between given reference sets and all other sets in multiExpr. Reference sets must have their corresponding module assignment specified in multiColor; module assignment is optional for test sets. Individual expression sets and their module labels are matched using names of the corresponding components in multiExpr and multiColor.

For each reference-test pair, the function calculates module preservation statistics that measure how well the modules of the reference set are preserved in the test set. If the multiColor also contains module assignment for the test set, the calculated statistics also include cross-tabulation statistics that make use of the test module assignment.

For each reference-test pair, the function only uses genes (columns of the data component of each component of multiExpr) that are in common between the reference and test set. Columns are matched by column names, so column names must be valid.

In addition to preservation statistics, the function also calculates several statistics of module quality, that is measures of how well-defined modules are in the reference set. The quality statistics are calculated with respect to genes in common with with a test set; thus the function calculates a set of quality statistics for each reference-test pair. This may be somewhat counter-intuitive, but it allows a direct comparison of corresponding quality and preservation statistics.

The calculated p-values are determined from the Z scores of individual measures under assumption of normality. No p-value is calculated for the Zsummary measures. Bonferoni correction to the number of tested modules. Because the p-values for strongly preserved modules are often extremely low, the function reports natural logarithms (base e) of the p-values. However, q-values are reported untransformed since they are calculated that way in package qvalue.

Missing data are removed (but see quickCor above).

References

Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath, to appear