hierarchicalConsensusModules: Hierarchical consensus network construction and module identification

Description

Hierarchical consensus network construction and module identification across multiple data sets.

Usage

hierarchicalConsensusModules(
   multiExpr, 
   multiWeights = NULL,
   multiExpr.imputed = NULL,
   # Data checking options
   checkMissingData = TRUE,
   # Blocking options
   blocks = NULL, 
   maxBlockSize = 5000, 
   blockSizePenaltyPower = 5,
   nPreclusteringCenters = NULL,
   randomSeed = 12345,
   # Network construction options. 
   networkOptions,
   # Save individual TOMs?
   saveIndividualTOMs = TRUE,
   individualTOMFileNames = "individualTOM-Set%s-Block%b.RData",
   keepIndividualTOMs = FALSE,
   # Consensus calculation options
   consensusTree = NULL,  
   # Return options
   saveConsensusTOM = TRUE,
   consensusTOMFilePattern = "consensusTOM-%a-Block%b.RData",
   # Keep the consensus? 
   keepConsensusTOM = saveConsensusTOM,
   # Internal handling of TOMs
   useDiskCache = NULL, chunkSize = NULL,
   cacheBase = ".blockConsModsCache",
   cacheDir = ".",
   # Alternative consensus TOM input from a previous calculation 
   consensusTOMInfo = NULL,
   # Basic tree cut options 
   deepSplit = 2, 
   detectCutHeight = 0.995, minModuleSize = 20,
   checkMinModuleSize = TRUE,
   # Advanced tree cut opyions
   maxCoreScatter = NULL, minGap = NULL,
   maxAbsCoreScatter = NULL, minAbsGap = NULL,
   minSplitHeight = NULL, minAbsSplitHeight = NULL,
   useBranchEigennodeDissim = FALSE,
   minBranchEigennodeDissim = mergeCutHeight,
   stabilityLabels = NULL,
   stabilityCriterion = c("Individual fraction", "Common fraction"),
   minStabilityDissim = NULL,
   pamStage = TRUE,  pamRespectsDendro = TRUE,
   iteratePruningAndMerging = FALSE,
   minCoreKME = 0.5, minCoreKMESize = minModuleSize/3,
   minKMEtoStay = 0.2,
   # Module eigengene calculation options
   impute = TRUE,
   trapErrors = FALSE,
   # Module merging options
   calibrateMergingSimilarities = FALSE,
   mergeCutHeight = 0.15, 
                    
   # General options
   collectGarbage = TRUE,
   verbose = 2, indent = 0,
   ...)

Arguments

multiExpr

Expression data in the multi-set format (see checkSets). A vector of lists, one per set. Each set must contain a component data that contains the expression data, with rows corresponding to samples and columns to genes or probes.

multiWeights

optional observation weights in the same format (and dimensions) as multiExpr. These weights are used for correlation calculations with data in multiExpr.

multiExpr.imputed

If multiExpr contain missing data, this argument can be used to supply the expression data with missing data imputed. If not given, the impute.knn function will be used to impute the missing data.

checkMissingData

Logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details.

blocks

Optional specification of blocks in which hierarchical clustering and module detection should be performed. If given, must be a numeric vector with one entry per gene of multiExpr giving the number of the block to which the corresponding gene belongs.

maxBlockSize

Integer giving maximum block size for module detection. Ignored if blocks above is non-NULL. Otherwise, if the number of genes in datExpr exceeds maxBlockSize, genes will be pre-clustered into blocks whose size should not exceed maxBlockSize.

blockSizePenaltyPower

Number specifying how strongly blocks should be penalized for exceeding the maximum size. Set to a lrge number or Inf if not exceeding maximum block size is very important.

nPreclusteringCenters

Number of centers to be used in the preclustering. Defaults to smaller of nGenes/20 and 100*nGenes/maxBlockSize, where nGenes is the nunber of genes (variables) in multiExpr.

randomSeed

Integer to be used as seed for the random number generator before the function starts. If a current seed exists, it is saved and restored upon exit. If NULL is given, the function will not save and restore the seed.

networkOptions

A single list of class NetworkOptions giving options for network calculation for all of the networks, or a multiData structure containing one such list for each input data set.

saveIndividualTOMs

Logical: should individual TOMs be saved to disk (TRUE) or retuned directly in the return value (FALSE)?

individualTOMFileNames

Character string giving the file names to save individual TOMs into. The following tags should be used to make the file names unique for each set and block: %s will be replaced by the set number; %N will be replaced by the set name (taken from names(multiExpr)) if it exists, otherwise by set number; %b will be replaced by the block number. If the file names turn out to be non-unique, an error will be generated.

keepIndividualTOMs

Logical: should individual TOMs be retained after the calculation is finished?

consensusTree

A list specifying the consensus calculation. See details.

saveConsensusTOM

Logical: should the consensus TOM be saved to disk?

consensusTOMFilePattern

Character string giving the file names to save consensus TOMs into. The following tags should be used to make the file names unique for each set and block: %s will be replaced by the set number; %N will be replaced by the set name (taken from names(multiExpr)) if it exists, otherwise by set number; %b will be replaced by the block number. If the file names turn out to be non-unique, an error will be generated.

keepConsensusTOM

Logical: should consensus TOM be retained after the calculation ends? Depending on saveConsensusTOM, the retained TOM is either saved to disk or returned within the return value.

useDiskCache

Logical: should disk cache be used for consensus calculations? The disk cache can be used to store chunks of calibrated data that are small enough to fit one chunk from each set into memory (blocks may be small enough to fit one block of one set into memory, but not small enough to fit one block from all sets in a consensus calculation into memory at the same time). Using disk cache is slower but lessens the memory footprint of the calculation. As a general guide, if individual data are split into blocks, we recommend setting this argument to TRUE. If this argument is NULL, the function will decide whether to use disk cache based on the number of sets and block sizes.

chunkSize

Integer giving the chunk size. If left NULL, a suitable size will be chosen automatically.

cacheDir

Directory in which to save cache files. The files are deleted on normal exit but persist if the function terminates abnormally.

cacheBase

Base for the file names of cache files.

consensusTOMInfo

If the consensus TOM has been pre-calculated using function hierarchicalConsensusTOM, this argument can be used to supply it. If given, the consensus TOM calculation options above are ignored.

deepSplit

Numeric value between 0 and 4. Provides a simplified control over how sensitive module detection should be to module splitting, with 0 least and 4 most sensitive. See cutreeDynamic for more details.

detectCutHeight

Dendrogram cut height for module detection. See cutreeDynamic for more details.

minModuleSize

Minimum module size for module detection. See cutreeDynamic for more details.

checkMinModuleSize

logical: should sanity checks be performed on minModuleSize?

maxCoreScatter

maximum scatter of the core for a branch to be a cluster, given as the fraction of cutHeight relative to the 5th percentile of joining heights. See cutreeDynamic for more details.

minGap

minimum cluster gap given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. See cutreeDynamic for more details.

maxAbsCoreScatter

maximum scatter of the core for a branch to be a cluster given as absolute heights. If given, overrides maxCoreScatter. See cutreeDynamic for more details.

minAbsGap

minimum cluster gap given as absolute height difference. If given, overrides minGap. See cutreeDynamic for more details.

minSplitHeight

Minimum split height given as the fraction of the difference between cutHeight and the 5th percentile of joining heights. Branches merging below this height will automatically be merged. Defaults to zero but is used only if minAbsSplitHeight below is NULL.

minAbsSplitHeight

Minimum split height given as an absolute height. Branches merging below this height will automatically be merged. If not given (default), will be determined from minSplitHeight above.

useBranchEigennodeDissim

Logical: should branch eigennode (eigengene) dissimilarity be considered when merging branches in Dynamic Tree Cut?

minBranchEigennodeDissim

Minimum consensus branch eigennode (eigengene) dissimilarity for branches to be considerd separate. The branch eigennode dissimilarity in individual sets is simly 1-correlation of the eigennodes; the consensus is defined as quantile with probability consensusQuantile.

stabilityLabels

Optional matrix of cluster labels that are to be used for calculating branch dissimilarity based on split stability. The number of rows must equal the number of genes in multiExpr; the number of columns (clusterings) is arbitrary. See branchSplitFromStabilityLabels for details.

stabilityCriterion

One of c("Individual fraction", "Common fraction"), indicating which method for assessing stability similarity of two branches should be used. We recommend "Individual fraction" which appears to perform better; the "Common fraction" method is provided for backward compatibility since it was the (only) method available prior to WGCNA version 1.60.

minStabilityDissim

Minimum stability dissimilarity criterion for two branches to be considered separate. Should be a number between 0 (essentially no dissimilarity required) and 1 (perfect dissimilarity or distinguishability based on stabilityLabels). See branchSplitFromStabilityLabels for details.

pamStage

logical. If TRUE, the second (PAM-like) stage of module detection will be performed. See cutreeDynamic for more details.

pamRespectsDendro

Logical, only used when pamStage is TRUE. If TRUE, the PAM stage will respect the dendrogram in the sense an object can be PAM-assigned only to clusters that lie below it on the branch that the object is merged into. See cutreeDynamic for more details.

iteratePruningAndMerging

Logical: should pruning of low-KME genes and module merging be iterated? For backward compatibility, the default is FALSE but it setting it to TRUE may lead to better-defined modules.

minCoreKME

a number between 0 and 1. If a detected module does not have at least minModuleKMESize genes with eigengene connectivity at least minCoreKME, the module is disbanded (its genes are unlabeled and returned to the pool of genes waiting for mofule detection).

minCoreKMESize

see minCoreKME above.

minKMEtoStay

genes whose eigengene connectivity to their module eigengene is lower than minKMEtoStay are removed from the module.

impute

logical: should imputation be used for module eigengene calculation? See moduleEigengenes for more details.

trapErrors

logical: should errors in calculations be trapped?

calibrateMergingSimilarities

Logical: should module eigengene similarities be calibrataed before calculating the consensus? Although calibration is in principle desirable, the calibration methods currently available assume large data and do not work very well on eigengene similarities.

mergeCutHeight

Dendrogram cut height for module merging.

collectGarbage

Logical: should garbage be collected after some of the memory-intensive steps?

verbose

integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.

indent

indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.

…

Other arguments. Currently ignored.

Value

List with the following components:

labels

A numeric vector with one component per variable (gene), giving the module label of each variable (gene). Label 0 is reserved for unassigned variables; module labels are sequential and smaller numbers are used for larger modules.

unmergedLabels

A numeric vector with one component per variable (gene), giving the unmerged module label of each variable (gene), i.e., module labels before the call to module merging.

colors

A character vector with one component per variable (gene), giving the module colors. The labels are mapped to colors using labels2colors.

unmergedColors

A character vector with one component per variable (gene), giving the unmerged module colors.

multiMEs

Module eigengenes corresponding to the modules returned in colors, in multi-set format. A vector of lists, one per set, containing eigengenes, proportion of variance explained and other information. See multiSetMEs for a detailed description.

dendrograms

A list with one component for each block of genes. Each component is the hierarchical clustering dendrogram obtained by clustering the consensus gene dissimilarity in the corresponding block.

consensusTOMInfo

A list detailing various aspects of the consensus TOM. See hierarchicalConsensusTOM for details.

blockInfo

A list with information about blocks as well as the vriables and observations (genes and samples) retained after filtering out those with zero variance and too many missing values.

moduleIdentificationArguments

A list with the module identification arguments supplied to this function. Contains deepSplit, detectCutHeight, minModuleSize, maxCoreScatter, minGap, maxAbsCoreScatter, minAbsGap, minSplitHeight, useBranchEigennodeDissim, minBranchEigennodeDissim, minStabilityDissim, pamStage, pamRespectsDendro, minCoreKME, minCoreKMESize, minKMEtoStay, calibrateMergingSimilarities, and mergeCutHeight.

Details

This function calculates a consensus network with a flexible, possibly hierarchical consensus specification, identifies (consensus) modules in the network, and calculates their eigengenes. "Blockwise" calculation is available for large data sets for which a full network (TOM or adjacency matrix) would not fit into avilable RAM.

The input can be either several numerical data sets (expression etc) in the argument multiExpr together with all necessary network construction options, or a pre-calculated network, typically the result of a call to hierarchicalConsensusTOM.

Steps in the network construction include the following: (1) optional filtering of variables (genes) and observations (samples) that contain too many missing values or have zero variance; (2) optional pre-clustering to split data into blocks of manageable size; (3) calculation of adjacencies and optionally of TOMs in each individual data set; (4) calculation of consensus network from the individual networks; (5) hierarchical clustering and module identification; (6) trimming of modules by removing genes with low correlation with the eigengene of the module; and (7) merging of modules whose eigengenes are strongly correlated.

Steps 1-4 (up to and including the calculation of consensus network from the individual networks) are handled by the function hierarchicalConsensusTOM.

Variables (genes) are clustered using average-linkage hierarchical clustering and modules are identified in the resulting dendrogram by the Dynamic Hybrid tree cut.

Found modules are trimmed of genes whose consensus module membership kME (that is, correlation with module eigengene) is less than minKMEtoStay. Modules in which fewer than minCoreKMESize genes have consensus KME higher than minCoreKME are disbanded, i.e., their constituent genes are pronounced unassigned.

After all blocks have been processed, the function checks whether there are genes whose KME in the module they assigned is lower than KME to another module. If p-values of the higher correlations are smaller than those of the native module by the factor reassignThresholdPS (in every set), the gene is re-assigned to the closer module.

In the last step, modules whose eigengenes are highly correlated are merged. This is achieved by clustering module eigengenes using the dissimilarity given by one minus their correlation, cutting the dendrogram at the height mergeCutHeight and merging all modules on each branch. The process is iterated until no modules are merged. See mergeCloseModules for more details on module merging.

The module trimming and merging process is optionally iterated. Iterations are recommended but are (for now) not the default for backward compatibility.

References

Non-hierarchical consensus networks are described in Langfelder P, Horvath S (2007), Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54.

More in-depth discussion of selected topics can be found at http://www.peterlangfelder.com/ , and an FAQ at https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html .