Calculates topological overlaps in the given (expression) data. If the number of variables (columns) in the input data is too large, the data is first split using pre-clustering, then topological overlaps are calculated in each block.
blockwiseIndividualTOMs(
multiExpr, # Data checking options
checkMissingData = TRUE,
# Blocking options
blocks = NULL,
maxBlockSize = 5000,
blockSizePenaltyPower = 5,
nPreclusteringCenters = NULL,
randomSeed = 12345,
# Network construction arguments: correlation options
corType = "pearson",
maxPOutliers = 1,
quickCor = 0,
pearsonFallback = "individual",
cosineCorrelation = FALSE,
# Adjacency function options
power = 6,
networkType = "unsigned",
checkPower = TRUE,
replaceMissingAdjacencies = FALSE,
# Topological overlap options
TOMType = "unsigned",
TOMDenom = "min",
# Save individual TOMs? If not, they will be returned in the session.
saveTOMs = TRUE,
individualTOMFileNames = "individualTOM-Set%s-Block%b.RData",
# General options
nThreads = 0,
verbose = 2, indent = 0)
expression data in the multi-set format (see checkSets
). A vector of
lists, one per set. Each set must contain a component data
that contains the expression data, with
rows corresponding to samples and columns to genes or probes.
logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details.
optional specification of blocks in which hierarchical clustering and module detection
should be performed. If given, must be a numeric vector with one entry per gene
of multiExpr
giving the number of the block to which the corresponding gene belongs.
integer giving maximum block size for module detection. Ignored if blocks
above is non-NULL. Otherwise, if the number of genes in datExpr
exceeds maxBlockSize
, genes
will be pre-clustered into blocks whose size should not exceed maxBlockSize
.
number specifying how strongly blocks should be penalized for exceeding the
maximum size. Set to a lrge number or Inf
if not exceeding maximum block size is very important.
number of centers for pre-clustering. Larger numbers typically results in better
but slower pre-clustering. The default is as.integer(min(nGenes/20, 100*nGenes/preferredSize))
and is an attempt to arrive at a reasonable number given the resources available.
integer to be used as seed for the random number generator before the function
starts. If a current seed exists, it is saved and restored upon exit. If NULL
is given, the
function will not save and restore the seed.
character string specifying the correlation to be used. Allowed values are (unique
abbreviations of) "pearson"
and "bicor"
, corresponding to Pearson and bidweight
midcorrelation, respectively. Missing values are handled using the pariwise.complete.obs
option.
only used for corType=="bicor"
. Specifies the maximum percentile of data
that can be considered outliers on either
side of the median separately. For each side of the median, if
higher percentile than maxPOutliers
is considered an outlier by the weight function based on
9*mad(x)
, the width of the weight function is increased such that the percentile of outliers on
that side of the median equals maxPOutliers
. Using maxPOutliers=1
will effectively disable
all weight function broadening; using maxPOutliers=0
will give results that are quite similar (but
not equal to) Pearson correlation.
real number between 0 and 1 that controls the handling of missing data in the calculation of correlations. See details.
Specifies whether the bicor calculation, if used, should revert to Pearson when
median absolute deviation (mad) is zero. Recongnized values are (abbreviations of)
"none", "individual", "all"
. If set to
"none"
, zero mad will result in NA
for the corresponding correlation.
If set to "individual"
, Pearson calculation will be used only for columns that have zero mad.
If set to "all"
, the presence of a single zero mad will cause the whole variable to be treated in
Pearson correlation manner (as if the corresponding robust
option was set to FALSE
). Has no
effect for Pearson correlation. See bicor
.
logical: should the cosine version of the correlation calculation be used? The cosine calculation differs from the standard one in that it does not subtract the mean.
soft-thresholding power for netwoek construction.
network type. Allowed values are (unique abbreviations of) "unsigned"
,
"signed"
, "signed hybrid"
. See adjacency
.
logical: should basic sanity check be performed on the supplied power
? If
you would like to experiment with unusual powers, set the argument to FALSE
and proceed with
caution.
logical: should missing values in calculated adjacency be replaced by 0?
one of "none"
, "unsigned"
, "signed"
. If "none"
, adjacency
will be used for clustering. If "unsigned"
, the standard TOM will be used (more generally, TOM
function will receive the adjacency as input). If "signed"
, TOM will keep track of the sign of
correlations between neighbors. Note that the "unsigned"
vs. "signed"
distinction is only
relevant when networkType
is "unsigned"
. When networkType
is "signed"
or
"signed hybrid"
, there is no difference between TOMType="signed"
and TOMType="unsigned".
a character string specifying the TOM variant to be used. Recognized values are
"min"
giving the standard TOM described in Zhang and Horvath (2005), and "mean"
in which
the min
function in the denominator is replaced by mean
. The "mean"
may produce
better results in certain special situations but at this time should be considered experimental.
logical: should calculated TOMs be saved to disk (TRUE
) or returned in the return
value (FALSE
)? Returning calculated TOMs via the return value ay be more convenient bt not always
feasible if the matrices are too big to fit all in memory at the same time.
character string giving the file names to save individual TOMs into. The
following tags should be used to make the file names unique for each set and block: %s
will be
replaced by the set number; %N
will be replaced by the set name (taken from names(multiExpr)
)
if it exists, otherwise by set number; %b
will be replaced by the block number. If the file names
turn out to be non-unique, an error will be generated.
non-negative integer specifying the number of parallel threads to be used by certain parts of correlation calculations. This option only has an effect on systems on which a POSIX thread library is available (which currently includes Linux and Mac OSX, but excludes Windows). If zero, the number of online processors will be used if it can be determined dynamically, otherwise correlation calculations will use 2 threads.
integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
A list with the following components:
Only returned if input saveTOMs
is TRUE
. A matrix of character
strings giving the file names in which each block TOM is saved. Rows correspond to data sets and columns to
blocks.
Only returned if input saveTOMs
is FALSE
. A list in which each
component corresponds to one block. Each component is a matrix of dimensions (N times (number of sets)), where
N is the length of a distance structure corresponding to the block. That is, if the block contains n genes,
N=n*(n-1)/2. Each column of the matrix contains the topological overlap of variables in the corresponding set (
and the corresponding block), arranged as a distance structure. Do note however that the topological overlap
is a similarity (not a distance).
if input blocks
was given, its copy; otherwise a vector of length equal number of
genes giving the block label for each gene. Note that block labels are not necessarilly sorted in the
order in which the blocks were processed (since we do not require this for the input blocks
). See
blockOrder
below.
a list with one component for each block of genes. Each component is a vector giving
the indices (relative to the input multiExpr
) of genes in the corresponding block.
if input
checkMissingData
is TRUE
, the output of the function goodSamplesGenesMS
.
A list with components
goodGenes
(logical vector indicating which genes passed the missing data filters), goodSamples
(a list of logical vectors indicating which samples passed the missing data filters in each set), and
allOK
(a logical indicating whether all genes and all samples passed the filters). See
goodSamplesGenesMS
for more details. If checkMissingData
is FALSE
,
goodSamplesAndGenes
contains a list of the same type but indicating that all genes and all samples
passed the missing data filters.
The following components are present mostly to streamline the interaction of this function with blockwiseConsensusModules.
Number of genes that passed missing data filters (if input
checkMissingData
is TRUE
), or the number of all genes (if checkMissingData
is
FALSE
).
the vector blocks
(above), restricted to good genes only.
number of threads used to calculate correlation and TOM matrices.
logical: were calculated matrices saved in files (TRUE
) or returned in the
return value (FALSE
)?
integer codes for network and correlation type.
number of sets in input data.
the names
attribute of input multiExpr
.
The function starts by optionally filtering out samples that have too many missing entries and genes that have either too many missing entries or zero variance in at least one set. Genes that are filtered out are excluded from the TOM calculations.
If blocks
is not given and
the number of genes exceeds maxBlockSize
, genes are pre-clustered into blocks using the function
consensusProjectiveKMeans
; otherwise all genes are treated in a single block.
For each block of genes, the network is constructed and (if requested) topological overlap is calculated in each set. The topological overlaps can be saved to disk as RData files, or returned directly within the return value (see below). Note that the matrices can be big and returning them within the return value can quickly exhaust the system's memory. In particular, if the block-wise calculation is necessary, it is nearly certain that returning all matrices via the return value will be impossible.
For a general discussion of the weighted network formalism, see
Bin Zhang and Steve Horvath (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Article 17
The blockwise approach is briefly described in the article describing this package,
Langfelder P, Horvath S (2008) "WGCNA: an R package for weighted correlation network analysis". BMC Bioinformatics 2008, 9:559