hierarchicalConsensusModules: Hierarchical consensus network construction and module identification

Description

Hierarchical consensus network construction and module identification across multiple data sets.

Usage

hierarchicalConsensusModules(
   multiExpr, 
   multiWeights = NULL,
   multiExpr.imputed = NULL,
   # Data checking options
   checkMissingData = TRUE,
   # Blocking options
   blocks = NULL, 
   maxBlockSize = 5000, 
   blockSizePenaltyPower = 5,
   nPreclusteringCenters = NULL,
   randomSeed = 12345,
   # Network construction options. 
   networkOptions,
   # Save individual TOMs?
   saveIndividualTOMs = TRUE,
   individualTOMFileNames = "individualTOM-Set%s-Block%b.RData",
   keepIndividualTOMs = FALSE,
   # Consensus calculation options
   consensusTree = NULL,  
   # Return options
   saveConsensusTOM = TRUE,
   consensusTOMFilePattern = "consensusTOM-%a-Block%b.RData",
   # Keep the consensus? 
   keepConsensusTOM = saveConsensusTOM,
   # Internal handling of TOMs
   useDiskCache = NULL, chunkSize = NULL,
   cacheBase = ".blockConsModsCache",
   cacheDir = ".",
   # Alternative consensus TOM input from a previous calculation 
   consensusTOMInfo = NULL,
   # Basic tree cut options 
   deepSplit = 2, 
   detectCutHeight = 0.995, minModuleSize = 20,
   checkMinModuleSize = TRUE,
   # Advanced tree cut opyions
   maxCoreScatter = NULL, minGap = NULL,
   maxAbsCoreScatter = NULL, minAbsGap = NULL,
   minSplitHeight = NULL, minAbsSplitHeight = NULL,
   useBranchEigennodeDissim = FALSE,
   minBranchEigennodeDissim = mergeCutHeight,
   stabilityLabels = NULL,
   stabilityCriterion = c("Individual fraction", "Common fraction"),
   minStabilityDissim = NULL,
   pamStage = TRUE,  pamRespectsDendro = TRUE,
   iteratePruningAndMerging = FALSE,
   minCoreKME = 0.5, minCoreKMESize = minModuleSize/3,
   minKMEtoStay = 0.2,
   # Module eigengene calculation options
   impute = TRUE,
   trapErrors = FALSE,
   excludeGrey = FALSE,
   # Module merging options
   calibrateMergingSimilarities = FALSE,
   mergeCutHeight = 0.15, 
                    
   # General options
   collectGarbage = TRUE,
   verbose = 2, indent = 0,
   ...)

Value

List with the following components:

labels: A numeric vector with one component per variable (gene), giving the module label of each variable (gene). Label 0 is reserved for unassigned variables; module labels are sequential and smaller numbers are used for larger modules.
unmergedLabels: A numeric vector with one component per variable (gene), giving the unmerged module label of each variable (gene), i.e., module labels before the call to module merging.
colors: A character vector with one component per variable (gene), giving the module colors. The labels are mapped to colors using labels2colors.
unmergedColors: A character vector with one component per variable (gene), giving the unmerged module colors.
multiMEs: Module eigengenes corresponding to the modules returned in colors, in multi-set format. A vector of lists, one per set, containing eigengenes, proportion of variance explained and other information. See multiSetMEs for a detailed description.
dendrograms: A list with one component for each block of genes. Each component is the hierarchical clustering dendrogram obtained by clustering the consensus gene dissimilarity in the corresponding block.
consensusTOMInfo: A list detailing various aspects of the consensus TOM. See hierarchicalConsensusTOM for details.
blockInfo: A list with information about blocks as well as the vriables and observations (genes and samples) retained after filtering out those with zero variance and too many missing values.
moduleIdentificationArguments: A list with the module identification arguments supplied to this function. Contains deepSplit, detectCutHeight, minModuleSize, maxCoreScatter, minGap, maxAbsCoreScatter, minAbsGap, minSplitHeight, useBranchEigennodeDissim, minBranchEigennodeDissim, minStabilityDissim, pamStage, pamRespectsDendro, minCoreKME, minCoreKMESize, minKMEtoStay, calibrateMergingSimilarities, and mergeCutHeight.

Arguments

multiExpr: Expression data in the multi-set format (see checkSets). A vector of lists, one per set. Each set must contain a component data that contains the expression data, with rows corresponding to samples and columns to genes or probes.
multiWeights: optional observation weights in the same format (and dimensions) as multiExpr. These weights are used for correlation calculations with data in multiExpr.
multiExpr.imputed: If multiExpr contain missing data, this argument can be used to supply the expression data with missing data imputed. If not given, the impute.knn function will be used to impute the missing data.
checkMissingData: Logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details.
blocks: Optional specification of blocks in which hierarchical clustering and module detection should be performed. If given, must be a numeric vector with one entry per gene of multiExpr giving the number of the block to which the corresponding gene belongs.
maxBlockSize: Integer giving maximum block size for module detection. Ignored if blocks above is non-NULL. Otherwise, if the number of genes in datExpr exceeds maxBlockSize, genes will be pre-clustered into blocks whose size should not exceed maxBlockSize.
blockSizePenaltyPower: Number specifying how strongly blocks should be penalized for exceeding the maximum size. Set to a lrge number or Inf if not exceeding maximum block size is very important.
nPreclusteringCenters: Number of centers to be used in the preclustering. Defaults to smaller of nGenes/20 and 100*nGenes/maxBlockSize, where nGenes is the nunber of genes (variables) in multiExpr.
randomSeed: Integer to be used as seed for the random number generator before the function starts. If a current seed exists, it is saved and restored upon exit. If NULL is given, the function will not save and restore the seed.
networkOptions: A single list of class NetworkOptions giving options for network calculation for all of the networks, or a multiData structure containing one such list for each input data set.
saveIndividualTOMs: Logical: should individual TOMs be saved to disk (TRUE) or retuned directly in the return value (FALSE)?
individualTOMFileNames: Character string giving the file names to save individual TOMs into. The following tags should be used to make the file names unique for each set and block: %s will be replaced by the set number; %N will be replaced by the set name (taken from names(multiExpr)) if it exists, otherwise by set number; %b will be replaced by the block number. If the file names turn out to be non-unique, an error will be generated.
keepIndividualTOMs: Logical: should individual TOMs be retained after the calculation is finished?
consensusTree: A list specifying the consensus calculation. See details.
saveConsensusTOM: Logical: should the consensus TOM be saved to disk?
consensusTOMFilePattern: Character string giving the file names to save consensus TOMs into. The following tags should be used to make the file names unique for each set and block: %s will be replaced by the set number; %N will be replaced by the set name (taken from names(multiExpr)) if it exists, otherwise by set number; %b will be replaced by the block number. If the file names turn out to be non-unique, an error will be generated.
keepConsensusTOM: Logical: should consensus TOM be retained after the calculation ends? Depending on saveConsensusTOM, the retained TOM is either saved to disk or returned within the return value.
useDiskCache: Logical: should disk cache be used for consensus calculations? The disk cache can be used to store chunks of calibrated data that are small enough to fit one chunk from each set into memory (blocks may be small enough to fit one block of one set into memory, but not small enough to fit one block from all sets in a consensus calculation into memory at the same time). Using disk cache is slower but lessens the memory footprint of the calculation. As a general guide, if individual data are split into blocks, we recommend setting this argument to TRUE. If this argument is NULL, the function will decide whether to use disk cache based on the number of sets and block sizes.
chunkSize: Integer giving the chunk size. If left NULL, a suitable size will be chosen automatically.
cacheDir: Directory in which to save cache files. The files are deleted on normal exit but persist if the function terminates abnormally.
cacheBase: Base for the file names of cache files.
consensusTOMInfo: If the consensus TOM has been pre-calculated using function hierarchicalConsensusTOM, this argument can be used to supply it. If given, the consensus TOM calculation options above are ignored.
deepSplit: Numeric value between 0 and 4. Provides a simplified control over how sensitive module detection should be to module splitting, with 0 least and 4 most sensitive. See cutreeDynamic for more details.
detectCutHeight: Dendrogram cut height for module detection. See cutreeDynamic for more details.
minModuleSize: Minimum module size for module detection. See cutreeDynamic for more details.
checkMinModuleSize: logical: should sanity checks be performed on minModuleSize?
maxCoreScatter: maximum scatter of the core for a branch to be a cluster, given as the fraction of cutHeight relative to the 5th percentile of joining heights. See cutreeDynamic for more details.
minGap: minimum cluster gap given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. See cutreeDynamic for more details.
maxAbsCoreScatter: maximum scatter of the core for a branch to be a cluster given as absolute heights. If given, overrides maxCoreScatter. See cutreeDynamic for more details.
minAbsGap: minimum cluster gap given as absolute height difference. If given, overrides minGap. See cutreeDynamic for more details.
minSplitHeight: Minimum split height given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. Branches merging below this height will automatically be merged. Defaults to zero but is used only if minAbsSplitHeight below is NULL.
minAbsSplitHeight: Minimum split height given as an absolute height. Branches merging below this height will automatically be merged. If not given (default), will be determined from minSplitHeight above.
useBranchEigennodeDissim: Logical: should branch eigennode (eigengene) dissimilarity be considered when merging branches in Dynamic Tree Cut?
minBranchEigennodeDissim: Minimum consensus branch eigennode (eigengene) dissimilarity for branches to be considerd separate. The branch eigennode dissimilarity in individual sets is simly 1-correlation of the eigennodes; the consensus is defined as quantile with probability consensusQuantile.
stabilityLabels: Optional matrix of cluster labels that are to be used for calculating branch dissimilarity based on split stability. The number of rows must equal the number of genes in multiExpr; the number of columns (clusterings) is arbitrary. See branchSplitFromStabilityLabels for details.
stabilityCriterion: One of c("Individual fraction", "Common fraction"), indicating which method for assessing stability similarity of two branches should be used. We recommend "Individual fraction" which appears to perform better; the "Common fraction" method is provided for backward compatibility since it was the (only) method available prior to WGCNA version 1.60.
minStabilityDissim: Minimum stability dissimilarity criterion for two branches to be considered separate. Should be a number between 0 (essentially no dissimilarity required) and 1 (perfect dissimilarity or distinguishability based on stabilityLabels). See branchSplitFromStabilityLabels for details.
pamStage: logical. If TRUE, the second (PAM-like) stage of module detection will be performed. See cutreeDynamic for more details.
pamRespectsDendro: Logical, only used when pamStage is TRUE. If TRUE, the PAM stage will respect the dendrogram in the sense an object can be PAM-assigned only to clusters that lie below it on the branch that the object is merged into. See cutreeDynamic for more details.
iteratePruningAndMerging: Logical: should pruning of low-KME genes and module merging be iterated? For backward compatibility, the default is FALSE but it setting it to TRUE may lead to better-defined modules.
minCoreKME: a number between 0 and 1. If a detected module does not have at least minModuleKMESize genes with eigengene connectivity at least minCoreKME, the module is disbanded (its genes are unlabeled and returned to the pool of genes waiting for mofule detection).
minCoreKMESize: see minCoreKME above.
minKMEtoStay: genes whose eigengene connectivity to their module eigengene is lower than minKMEtoStay are removed from the module.
impute: logical: should imputation be used for module eigengene calculation? See moduleEigengenes for more details.
trapErrors: logical: should errors in calculations be trapped?
excludeGrey: logical: should the returned module eigengenes exclude the eigengene of the "module" that contains unassigned genes?
calibrateMergingSimilarities: Logical: should module eigengene similarities be calibrataed before calculating the consensus? Although calibration is in principle desirable, the calibration methods currently available assume large data and do not work very well on eigengene similarities.
mergeCutHeight: Dendrogram cut height for module merging.
collectGarbage: Logical: should garbage be collected after some of the memory-intensive steps?
verbose: integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indent: indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
...: Other arguments. Currently ignored.

Author

Peter Langfelder

Details

This function calculates a consensus network with a flexible, possibly hierarchical consensus specification, identifies (consensus) modules in the network, and calculates their eigengenes. "Blockwise" calculation is available for large data sets for which a full network (TOM or adjacency matrix) would not fit into avilable RAM.

The input can be either several numerical data sets (expression etc) in the argument multiExpr together with all necessary network construction options, or a pre-calculated network, typically the result of a call to hierarchicalConsensusTOM.

Steps in the network construction include the following: (1) optional filtering of variables (genes) and observations (samples) that contain too many missing values or have zero variance; (2) optional pre-clustering to split data into blocks of manageable size; (3) calculation of adjacencies and optionally of TOMs in each individual data set; (4) calculation of consensus network from the individual networks; (5) hierarchical clustering and module identification; (6) trimming of modules by removing genes with low correlation with the eigengene of the module; and (7) merging of modules whose eigengenes are strongly correlated.

Steps 1-4 (up to and including the calculation of consensus network from the individual networks) are handled by the function hierarchicalConsensusTOM.

Variables (genes) are clustered using average-linkage hierarchical clustering and modules are identified in the resulting dendrogram by the Dynamic Hybrid tree cut.

Found modules are trimmed of genes whose consensus module membership kME (that is, correlation with module eigengene) is less than minKMEtoStay. Modules in which fewer than minCoreKMESize genes have consensus KME higher than minCoreKME are disbanded, i.e., their constituent genes are pronounced unassigned.

After all blocks have been processed, the function checks whether there are genes whose KME in the module they assigned is lower than KME to another module. If p-values of the higher correlations are smaller than those of the native module by the factor reassignThresholdPS (in every set), the gene is re-assigned to the closer module.

In the last step, modules whose eigengenes are highly correlated are merged. This is achieved by clustering module eigengenes using the dissimilarity given by one minus their correlation, cutting the dendrogram at the height mergeCutHeight and merging all modules on each branch. The process is iterated until no modules are merged. See mergeCloseModules for more details on module merging.

The module trimming and merging process is optionally iterated. Iterations are recommended but are (for now) not the default for backward compatibility.

References

Non-hierarchical consensus networks are described in Langfelder P, Horvath S (2007), Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54.

More in-depth discussion of selected topics can be found at http://www.peterlangfelder.com/ , and an FAQ at https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html .