This function calculates correlation network matrices (adjacencies or topological overlaps), after optionally first pre-clustering input data into blocks.
individualTOMs(
multiExpr,
multiWeights = NULL,
multiExpr.imputed = NULL, # Data checking options
checkMissingData = TRUE,
# Blocking options
blocks = NULL,
maxBlockSize = 5000,
blockSizePenaltyPower = 5,
nPreclusteringCenters = NULL,
randomSeed = 54321,
# Network construction options
networkOptions,
# Save individual TOMs?
saveTOMs = TRUE,
individualTOMFileNames = "individualTOM-Set%s-Block%b.RData",
# Behaviour options
collectGarbage = TRUE,
verbose = 2, indent = 0)
A list with the following components:
A multiData
structure containing (possibly blockwise) network
matrices for each input data set. The network matrices are stored as BlockwiseData
objects.
A copy of names(multiExpr)
.
Number of sets in multiExpr
A list of class BlockInformation
, giving information about blocks and gene and
sample filtering.
The input networkOptions
, returned as a multiData
structure with
one entry per input data set.
expression data in the multi-set format (see checkSets
). A vector of
lists, one per set. Each set must contain a component data
that contains the expression data, with
rows corresponding to samples and columns to genes or probes.
optional observation weights in the same format (and dimensions) as multiExpr
.
These weights are used for correlation calculations with data in multiExpr
.
Optional version of multiExpr
with missing data imputed. If not given and multiExpr
contains
missing data, they will be imputed using the function impute.knn
.
logical: should data be checked for excessive numbers of missing entries in genes and samples, and for genes with zero variance? See details.
optional specification of blocks in which hierarchical clustering and module detection
should be performed. If given, must be a numeric vector with one entry per gene
of multiExpr
giving the number of the block to which the corresponding gene belongs.
integer giving maximum block size for module detection. Ignored if blocks
above is non-NULL. Otherwise, if the number of genes in datExpr
exceeds maxBlockSize
, genes
will be pre-clustered into blocks whose size should not exceed maxBlockSize
.
number specifying how strongly blocks should be penalized for exceeding the
maximum size. Set to a lrge number or Inf
if not exceeding maximum block size is very important.
number of centers to be used in the preclustering. Defaults to smaller of
nGenes/20
and 100*nGenes/maxBlockSize
, where nGenes
is the nunber of genes (variables)
in
multiExpr
.
integer to be used as seed for the random number generator before the function
starts. If a current seed exists, it is saved and restored upon exit. If NULL
is given, the
function will not save and restore the seed.
A single list of class NetworkOptions
giving options for network calculation for all of the
networks, or a multiData
structure containing one such list for each input data set.
logical: should individual TOMs be saved to disk (TRUE
) or retuned directly in the
return value (FALSE
)?
character string giving the file names to save individual TOMs into. The
following tags should be used to make the file names unique for each set and block: %s
will be
replaced by the set number; %N
will be replaced by the set name (taken from names(multiExpr)
)
if it exists, otherwise by set number; %b
will be replaced by the block number. If the file names
turn out to be non-unique, an error will be generated.
Logical: should garbage collection be called after each block calculation? This can be useful when the data are large, but could unnecessarily slow down calculation with small data.
Integer level of verbosity. Zero means silent, higher values make the output progressively more and more verbose.
Indentation for diagnostic messages. Zero means no indentation, each unit adds two spaces.
Peter Langfelder
The function starts by optionally filtering out samples that have too many missing entries and genes that have either too many missing entries or zero variance in at least one set. Genes that are filtered out are excluded from the network calculations.
If blocks
is not given and
the number of genes (columns) in multiExpr
exceeds maxBlockSize
, genes are pre-clustered into blocks using the function
consensusProjectiveKMeans
; otherwise all genes are treated in a single block. Any missing data
in multiExpr
will be imputed; if imputed data are already available, they can be supplied separately.
For each block of genes, the network adjacency is constructed and (if requested) topological overlap is calculated in each set. The topological overlaps can be saved to disk as RData files, or returned directly within the return value (see below). Note that the matrices can be big and returning them within the return value can quickly exhaust the system's memory. In particular, if the block-wise calculation is necessary, it is usually impossible to return all matrices in the return value.
Input arguments and output components of this function use multiData
,
NetworkOptions
, BlockwiseData
, and BlockInformation
.
Underlying functions of interest include consensusProjectiveKMeans
,
TOMsimilarityFromExpr
.