Hierarchical consensus network construction and module identification across multiple data sets.
hierarchicalConsensusModules(
multiExpr,
multiWeights = NULL,
multiExpr.imputed = NULL, # Data checking options
checkMissingData = TRUE,
# Blocking options
blocks = NULL,
maxBlockSize = 5000,
blockSizePenaltyPower = 5,
nPreclusteringCenters = NULL,
randomSeed = 12345,
# Network construction options.
networkOptions,
# Save individual TOMs?
saveIndividualTOMs = TRUE,
individualTOMFileNames = "individualTOM-Set%s-Block%b.RData",
keepIndividualTOMs = FALSE,
# Consensus calculation options
consensusTree = NULL,
# Return options
saveConsensusTOM = TRUE,
consensusTOMFilePattern = "consensusTOM-%a-Block%b.RData",
# Keep the consensus?
keepConsensusTOM = saveConsensusTOM,
# Internal handling of TOMs
useDiskCache = NULL, chunkSize = NULL,
cacheBase = ".blockConsModsCache",
cacheDir = ".",
# Alternative consensus TOM input from a previous calculation
consensusTOMInfo = NULL,
# Basic tree cut options
deepSplit = 2,
detectCutHeight = 0.995, minModuleSize = 20,
checkMinModuleSize = TRUE,
# Advanced tree cut opyions
maxCoreScatter = NULL, minGap = NULL,
maxAbsCoreScatter = NULL, minAbsGap = NULL,
minSplitHeight = NULL, minAbsSplitHeight = NULL,
useBranchEigennodeDissim = FALSE,
minBranchEigennodeDissim = mergeCutHeight,
stabilityLabels = NULL,
stabilityCriterion = c("Individual fraction", "Common fraction"),
minStabilityDissim = NULL,
pamStage = TRUE, pamRespectsDendro = TRUE,
iteratePruningAndMerging = FALSE,
minCoreKME = 0.5, minCoreKMESize = minModuleSize/3,
minKMEtoStay = 0.2,
# Module eigengene calculation options
impute = TRUE,
trapErrors = FALSE,
excludeGrey = FALSE,
# Module merging options
calibrateMergingSimilarities = FALSE,
mergeCutHeight = 0.15,
# General options
collectGarbage = TRUE,
verbose = 2, indent = 0,
...)
List with the following components:
A numeric vector with one component per variable (gene), giving the module label of each variable (gene). Label 0 is reserved for unassigned variables; module labels are sequential and smaller numbers are used for larger modules.
A numeric vector with one component per variable (gene), giving the unmerged module label of each variable (gene), i.e., module labels before the call to module merging.
A character vector with one component per variable (gene),
giving the module colors. The labels are mapped to colors using labels2colors
.
A character vector with one component per variable (gene), giving the unmerged module colors.
Module eigengenes corresponding to the modules returned in colors
, in multi-set
format. A vector of lists, one per set, containing eigengenes, proportion of variance explained and other
information. See multiSetMEs
for a detailed description.
A list with one component for each block of genes. Each component is the hierarchical clustering dendrogram obtained by clustering the consensus gene dissimilarity in the corresponding block.
A list detailing various aspects of the consensus TOM. See
hierarchicalConsensusTOM
for details.
A list with information about blocks as well as the vriables and observations (genes and samples) retained after filtering out those with zero variance and too many missing values.
A list with the module identification arguments supplied to this
function. Contains
deepSplit
,
detectCutHeight
,
minModuleSize
,
maxCoreScatter
,
minGap
,
maxAbsCoreScatter
,
minAbsGap
,
minSplitHeight
,
useBranchEigennodeDissim
,
minBranchEigennodeDissim
,
minStabilityDissim
,
pamStage
,
pamRespectsDendro
,
minCoreKME
,
minCoreKMESize
,
minKMEtoStay
,
calibrateMergingSimilarities
, and
mergeCutHeight
.
Expression data in the multi-set format (see checkSets
). A vector of
lists, one per set. Each set must contain a component data
that contains the expression data, with
rows corresponding to samples and columns to genes or probes.
optional observation weights in the same format (and dimensions) as multiExpr
.
These weights are used for correlation calculations with data in multiExpr
.
If multiExpr
contain missing data, this argument can be used to supply the
expression data with missing data imputed. If not given, the impute.knn
function will
be used to impute the missing data.
Logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details.
Optional specification of blocks in which hierarchical clustering and module detection
should be performed. If given, must be a numeric vector with one entry per gene
of multiExpr
giving the number of the block to which the corresponding gene belongs.
Integer giving maximum block size for module detection. Ignored if blocks
above is non-NULL. Otherwise, if the number of genes in datExpr
exceeds maxBlockSize
, genes
will be pre-clustered into blocks whose size should not exceed maxBlockSize
.
Number specifying how strongly blocks should be penalized for exceeding the
maximum size. Set to a lrge number or Inf
if not exceeding maximum block size is very important.
Number of centers to be used in the preclustering. Defaults to smaller of
nGenes/20
and 100*nGenes/maxBlockSize
, where nGenes
is the nunber of genes (variables)
in multiExpr
.
Integer to be used as seed for the random number generator before the function
starts. If a current seed exists, it is saved and restored upon exit. If NULL
is given, the
function will not save and restore the seed.
A single list of class NetworkOptions
giving options for network calculation for all of the
networks, or a multiData
structure containing one such list for each input data set.
Logical: should individual TOMs be saved to disk (TRUE
) or retuned directly in the
return value (FALSE
)?
Character string giving the file names to save individual TOMs into. The
following tags should be used to make the file names unique for each set and block: %s
will be
replaced by the set number; %N
will be replaced by the set name (taken from names(multiExpr)
)
if it exists, otherwise by set number; %b
will be replaced by the block number. If the file names
turn out to be non-unique, an error will be generated.
Logical: should individual TOMs be retained after the calculation is finished?
A list specifying the consensus calculation. See details.
Logical: should the consensus TOM be saved to disk?
Character string giving the file names to save consensus TOMs into. The
following tags should be used to make the file names unique for each set and block: %s
will be
replaced by the set number; %N
will be replaced by the set name (taken from names(multiExpr)
)
if it exists, otherwise by set number; %b
will be replaced by the block number. If the file names
turn out to be non-unique, an error will be generated.
Logical: should consensus TOM be retained after the calculation ends? Depending on saveConsensusTOM
,
the retained TOM is either saved to disk or returned within the return value.
Logical: should disk cache be used for consensus calculations? The disk cache can be used to store chunks of
calibrated data that are small enough to fit one chunk from each set into memory (blocks may be small enough
to fit one block of one set into memory, but not small enough to fit one block from all sets in a consensus
calculation into memory at the same time). Using disk cache is slower but lessens the memory footprint of
the calculation.
As a general guide, if individual data are split into blocks, we
recommend setting this argument to TRUE
. If this argument is NULL
, the function will decide
whether to use disk cache based on the number of sets and block sizes.
Integer giving the chunk size. If left NULL
, a suitable size will be chosen automatically.
Directory in which to save cache files. The files are deleted on normal exit but persist if the function terminates abnormally.
Base for the file names of cache files.
If the consensus TOM has been pre-calculated using function hierarchicalConsensusTOM
,
this argument can be used to supply it. If given, the consensus
TOM calculation options above are ignored.
Numeric value between 0 and 4. Provides a simplified control over how sensitive
module detection should be to module splitting, with 0 least and 4 most sensitive. See
cutreeDynamic
for more details.
Dendrogram cut height for module detection. See
cutreeDynamic
for more details.
Minimum module size for module detection. See
cutreeDynamic
for more details.
logical: should sanity checks be performed on minModuleSize
?
maximum scatter of the core for a branch to be a cluster, given as the fraction
of cutHeight
relative to the 5th percentile of joining heights. See
cutreeDynamic
for more details.
minimum cluster gap given as the fraction of the difference between cutHeight
and
the 5th percentile of joining heights. See cutreeDynamic
for more details.
maximum scatter of the core for a branch to be a cluster given as absolute
heights. If given, overrides maxCoreScatter
. See cutreeDynamic
for more details.
minimum cluster gap given as absolute height difference. If given, overrides
minGap
. See cutreeDynamic
for more details.
Minimum split height given as the fraction of the difference between
cutHeight
and the 5th percentile of joining heights. Branches merging below this height will
automatically be merged. Defaults to zero but is used only if minAbsSplitHeight
below is
NULL
.
Minimum split height given as an absolute height.
Branches merging below this height will automatically be merged. If not given (default), will be determined
from minSplitHeight
above.
Logical: should branch eigennode (eigengene) dissimilarity be considered when merging branches in Dynamic Tree Cut?
Minimum consensus branch eigennode (eigengene) dissimilarity for
branches to be considerd separate. The branch eigennode dissimilarity in individual sets
is simly 1-correlation of the
eigennodes; the consensus is defined as quantile with probability consensusQuantile
.
Optional matrix of cluster labels that are to be used for calculating branch
dissimilarity based on split stability. The number of rows must equal the number of genes in
multiExpr
; the number of columns (clusterings) is arbitrary. See
branchSplitFromStabilityLabels
for details.
One of c("Individual fraction", "Common fraction")
, indicating which method
for assessing stability similarity of two branches should be used. We recommend "Individual fraction"
which appears to perform better; the "Common fraction"
method is provided for backward compatibility
since it was the (only) method available prior to WGCNA version 1.60.
Minimum stability dissimilarity criterion for two branches to be considered
separate. Should be a number between 0 (essentially no dissimilarity required) and 1 (perfect dissimilarity
or distinguishability based on stabilityLabels
). See
branchSplitFromStabilityLabels
for details.
logical. If TRUE, the second (PAM-like) stage of module detection will be performed.
See cutreeDynamic
for more details.
Logical, only used when pamStage
is TRUE
.
If TRUE
, the PAM stage will
respect the dendrogram in the sense an object can be PAM-assigned only to clusters that lie below it on
the branch that the object is merged into.
See cutreeDynamic
for more details.
Logical: should pruning of low-KME genes and module merging be iterated?
For backward compatibility, the default is FALSE
but it setting it to TRUE
may lead to
better-defined modules.
a number between 0 and 1. If a detected module does not have at least
minModuleKMESize
genes with eigengene connectivity at least minCoreKME
, the module is
disbanded (its genes are unlabeled and returned to the pool of genes waiting for mofule detection).
see minCoreKME
above.
genes whose eigengene connectivity to their module eigengene is lower than
minKMEtoStay
are removed from the module.
logical: should imputation be used for module eigengene calculation? See
moduleEigengenes
for more details.
logical: should errors in calculations be trapped?
logical: should the returned module eigengenes exclude the eigengene of the "module" that contains unassigned genes?
Logical: should module eigengene similarities be calibrataed before calculating the consensus? Although calibration is in principle desirable, the calibration methods currently available assume large data and do not work very well on eigengene similarities.
Dendrogram cut height for module merging.
Logical: should garbage be collected after some of the memory-intensive steps?
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
Other arguments. Currently ignored.
Peter Langfelder
This function calculates a consensus network with a flexible, possibly hierarchical consensus specification, identifies (consensus) modules in the network, and calculates their eigengenes. "Blockwise" calculation is available for large data sets for which a full network (TOM or adjacency matrix) would not fit into avilable RAM.
The input can be either several numerical data sets (expression etc) in the argument multiExpr
together with all necessary network construction options, or a pre-calculated network, typically the result
of a call to hierarchicalConsensusTOM
.
Steps in the network construction include the following: (1) optional filtering of variables (genes) and observations (samples) that contain too many missing values or have zero variance; (2) optional pre-clustering to split data into blocks of manageable size; (3) calculation of adjacencies and optionally of TOMs in each individual data set; (4) calculation of consensus network from the individual networks; (5) hierarchical clustering and module identification; (6) trimming of modules by removing genes with low correlation with the eigengene of the module; and (7) merging of modules whose eigengenes are strongly correlated.
Steps 1-4 (up to and including the calculation of consensus network from the individual networks) are
handled by the function hierarchicalConsensusTOM
.
Variables (genes) are clustered using average-linkage hierarchical clustering and modules are identified in the resulting dendrogram by the Dynamic Hybrid tree cut.
Found modules are trimmed of genes whose
consensus module membership kME (that is, correlation with module eigengene)
is less than minKMEtoStay
.
Modules in which
fewer than minCoreKMESize
genes have consensus KME higher than minCoreKME
are disbanded, i.e., their constituent genes are pronounced
unassigned.
After all blocks have been processed, the function checks whether there are genes whose KME in the module
they assigned is lower than KME to another module. If p-values of the higher correlations are smaller
than those of the native module by the factor reassignThresholdPS
(in every set),
the gene is re-assigned to the closer module.
In the last step, modules whose eigengenes are highly correlated are merged. This is achieved by
clustering module eigengenes using the dissimilarity given by one minus their correlation,
cutting the dendrogram at the height mergeCutHeight
and merging all modules on each branch. The
process is iterated until no modules are merged. See mergeCloseModules
for more details on
module merging.
The module trimming and merging process is optionally iterated. Iterations are recommended but are (for now) not the default for backward compatibility.
Non-hierarchical consensus networks are described in Langfelder P, Horvath S (2007), Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54.
More in-depth discussion of selected topics can be found at http://www.peterlangfelder.com/ , and an FAQ at https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html .
hierarchicalConsensusTOM
for calculation of hierarchical consensus networks (adjacency and
TOM), and a more detailed description of the calculation;
hclust
and cutreeHybrid
for hierarchical clustering
and the Dynamic Tree Cut branch cutting method;
mergeCloseModules
for module merging;
blockwiseModules
for an analogous analysis on a single data set.