Quantify the preservation of network modules (sub-graphs) in an independent dataset through permutation testing on module topology. Seven network statistics (see details) are calculated for each module and then tested by comparing to distributions generated from their calculation on random subsets in the test dataset.
modulePreservation(
network,
data,
correlation,
moduleAssignments,
modules = NULL,
backgroundLabel = "0",
discovery = 1,
test = 2,
selfPreservation = FALSE,
nThreads = NULL,
nPerm = NULL,
null = "overlap",
alternative = "greater",
simplify = TRUE,
verbose = TRUE
)
A nested list structure. At the top level, the list has one element per
'discovery'
dataset. Each of these elements is a list that has one
element per 'test'
dataset analysed for that 'discovery'
dataset. Each of these elements is also a list, containing the following objects:
observed
:
A matrix of the observed values for the module preservation statistics.
Rows correspond to modules, and columns to the module preservation
statistics.
nulls
:
A three dimensional array containing the values of the module
preservation statistics evaluated on random permutation of module
assignment in the test network. Rows correspond to modules, columns to
the module preservation statistics, and the third dimension to the
permutations.
p.values
:
A matrix of p-values for the observed
module preservation
statistics as evaluated through a permutation test using the
corresponding values in nulls
.
nVarsPresent
:
A vector containing the number of variables that are present in the test
dataset for each module.
propVarsPresent
:
A vector containing the proportion of variables present in the test dataset
for each module. Modules where this is less than 1 should be
investigated further before making judgements about preservation to
ensure that the missing variables are not the most connected ones.
contingency
:
If moduleAssignments
are present for both the discovery
and test datasets, then a contingency table showing the overlap
between modules across datasets is returned. Rows correspond to modules
in the discovery dataset, columns to modules in the test
dataset.
When simplify = TRUE
then the simplest possible structure will be
returned. E.g. if module preservation is tested in only one dataset, then
the returned list will have only the above elements.
When simplify = FALSE
then a nested list of datasets will always be
returned, i.e. each element at the top level and second level correspond to
a dataset, e.g. results[["Dataset1"]][["Dataset2"]]
indicates an
analysis where modules discovered in "Dataset1" are assessed for
preservation in "Dataset2". Dataset comparisons which have not been
assessed will contain NULL
.
a list of interaction networks, one for each dataset. Each entry of the list should be a \(n * n\) matrix or where each element contains the edge weight between nodes \(i\) and \(j\) in the inferred network for that dataset.
a list of matrices, one for each dataset. Each entry of the list
should be the data used to infer the interaction network
for that
dataset. The columns should correspond to variables in the data
(nodes in the network) and rows to samples in that dataset.
a list of matrices, one for each dataset. Each entry of
the list should be a \(n * n\) matrix where each element contains the
correlation coefficient between nodes \(i\) and \(j\) in the
data
used to infer the interaction network for that dataset.
a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.
a list of vectors, one for each discovery
dataset,
of modules to perform the analysis on. If unspecified, all modules
in each discovery
dataset will be analysed, with the exception of
those specified in backgroundLabel
argument.
a single label given to nodes that do not belong to
any module in the moduleAssignments
argument. Defaults to "0". Set
to NULL
if you do not want to skip the network background module.
a vector of names or indices denoting the discovery
dataset(s) in the data
, correlation
, network
,
moduleAssignments
, modules
, and test
lists.
a list of vectors, one for each discovery
dataset,
of names or indices denoting the test dataset(s) in the data
,
correlation
, and network
lists.
logical; if FALSE
(default) then module
preservation analysis will not be performed within a dataset (i.e.
where the discovery
and test
datasets are the same).
number of threads to parallelise the calculation of network properties over. Automatically determined as the number of cores - 1 if not specified.
number of permutations to use. If not specified, the number of permutations will be automatically determined (see details). When set to 0 the permutation procedure will be skipped and the observed module preservation will be returned without p-values.
variables to include when generating the null distributions. Must be either "overlap" or "all" (see details).
The type of module preservation test to perform. Must be one of "greater" (default), "less" or "two.sided" (see details).
logical; if TRUE
, simplify the structure of the output
list if possible (see Return Value).
logical; should progress be reported? Default is TRUE
.
The preservation of network modules in a second dataset is quantified by
measuring the preservation of topological properties between the
discovery and test datasets. These properties are calculated
not only from the interaction networks inferred in each dataset, but also
from the data used to infer those networks (e.g. gene expression data) as
well as the correlation structure between variables/nodes. Thus, all
functions in the NetRep
package have the following arguments:
network
:
a list of interaction networks, one for each dataset.
data
:
a list of data matrices used to infer those networks, one for each
dataset.
correlation
:
a list of matrices containing the pairwise correlation coefficients
between variables/nodes in each dataset.
moduleAssignments
:
a list of vectors, one for each discovery dataset, containing
the module assignments for each node in that dataset.
modules
:
a list of vectors, one for each discovery dataset, containing
the names of the modules from that dataset to analyse.
discovery
:
a vector indicating the names or indices of the previous arguments'
lists to use as the discovery dataset(s) for the analyses.
test
:
a list of vectors, one vector for each discovery dataset,
containing the names or indices of the network
, data
, and
correlation
argument lists to use as the test dataset(s)
for the analysis of each discovery dataset.
The formatting of these arguments is not strict: each function will attempt
to make sense of the user input. For example, if there is only one
discovery
dataset, then input to the moduleAssigments
and
test
arguments may be vectors, rather than lists.
Matrices in the network
, data
, and correlation
lists
can be supplied as disk.matrix
objects. This class allows
matrix data to be kept on disk and loaded as required by NetRep.
This dramatically decreases memory usage: the matrices for only one
dataset will be kept in RAM at any point in time.
Additional memory usage of the permutation procedure is directly proportional to the sum of module sizes squared multiplied by the number of threads. Very large modules may result in significant additional memory usage per core due to extraction of the correlation coefficient sub-matrix at each permutation.
Module preservation is assessed through seven module preservation statistics, each of which captures a different aspect of a module's topology; i.e. the structure of the relationships between its nodes (1,2). Below is a description of each statistic, what they seek to measure, and where their interpretation may be inappropriate.
The module coherence ('coherence'
), average node
contribution ('avg.contrib'
), and concordance of node
contribution ('cor.contrib'
) are all calculated from the data used
to infer the network (provided in the 'data'
argument). They are
calculated from the module's summary profile. This is the eigenvector
of the 1st principal component across all observations for every node
composing the module. For gene coexpression modules this can be interpreted
as a "summary expression profile". It is typically referred to as the
"module eigengene" in the weighted gene coexpression network analysis
literature (4).
The module coherence ('coherence'
) quantifies the proportion
of module variance explained by the module's "summary profile". The higher
this value, the more "coherent" the data is, i.e. the more similar
the observations are nodes for each sample. With the default alternate
hypothesis, a small permutation P-value indicates that the module is
more coherent than expected by chance.
The average node contribution ('avg.contrib'
) and
concordance of node contribution ('cor.contrib'
) are calculated
from the node contribution, which quantifies how similar each node is
to the modules's summary profile. It is calculated as the Pearson
correlation coefficient between each node and the module summary profile. In
the weighted gene coexpression network literature it is typically called the
"module membership" (2).
The average node contribution ('avg.contrib'
) quantifies how
similar nodes are to the module summary profile in the test dataset. Nodes
detract from this score where the sign of their node contribution flips
between the discovery and test datasets, e.g. in the case of
differential gene expression across conditions. A high average node
contribution with a small permutation P-value indicates that the
module remains coherent in the test dataset, and that the nodes are acting
together in a similar way.
The concordance of node contribution ('cor.contrib'
) measures
whether the relative rank of nodes (in terms of their node contribution) is
preserved across datasets. If a module is coherent enough that all nodes
contribute strongly, then this statistic will not be meaningful as its value
will be heavily influenced by tiny variations in node rank. This can be
assessed through visualisation of the module topology (see
plotContribution
.) Similarly, a strong
'cor.contrib'
is unlikely to be meaningful if the
'avg.contrib'
is not significant.
The concordance of correlation strucutre ('cor.cor'
) and
density of correlation structure ('avg.cor'
) are calculated
from the user-provided correlation structure between nodes (provided in the
'correlation'
argument). This is referred to as "coexpression" when
calculated on gene expression data.
The 'avg.cor'
measures how strongly nodes within a module are
correlation on average in the test dataset. This average depends on the
correlation coefficients in the discovery dataset: the score is penalised
where correlation coefficients change in sign between datasets. A high
'avg.cor'
with a small permutation P-value indicates that the
module is (a) more strongly correlated than expected by chance for a module
of the same size, and (b) more consistently correlated with respect to the
discovery dataset than expected by chance.
The 'cor.cor'
measures how similar the correlation coefficients are
across the two datasets. A high 'cor.cor'
with a small permutation
P-value indicates that the correlation structure within a module is
more similar across datasets than expected by chance. If all nodes within a
module are very similarly correlated then this statistic will not be
meaningful, as its value will be heavily influenced by tiny, non-meaningful,
variations in correlation strength. This can be assessed through
visualisation of the module topology (see plotCorrelation
.)
Similarly, a strong 'cor.cor'
is unlikely to be meaningful if the
'avg.cor'
is not significant.
The average edge weight ('avg.weight'
) and concordance
of weighted degree ('cor.degree'
) are both calculated from the
interaction network (provided as adjacency matrices to the 'network'
argument).
The 'avg.weight'
measures the average connection strength between
nodes in the test dataset. In the weighted gene coexpression network
literature this is typically called the "module density" (2). A high
'avg.weight'
with a small permutation P-value indicates that
the module is more strongly connected in the test dataset than expected by
chance.
The 'cor.degree'
calculates whether the relative rank of each node's
weighted degree is similar across datasets. The weighted
degree is calculated as the sum of a node's edge weights to all other nodes
in the module. In the weighted gene coexpression network literature this is
typically called the "intramodular connectivity" (2). This statistic
will not be meaningful where all nodes are connected to each other with
similar strength, as its value will be heavily influenced by tiny,
non-meaningful, variations in weighted degree. This can be assessed through
visualisation of the module topology (see plotDegree
.)
Both the 'avg.weight'
and 'cor.degree'
assume edges are
weighted, and that the network is densely connected. Note that for sparse
networks, edges with zero weight are included when calculating both
statistics. Only the magnitude of the weights, not their sign, contribute to
the score. If the network is unweighted, i.e. edges indicate
presence or absence of a relationship, then the 'avg.weight'
will be
the proportion of the number of edges to the total number of possible edges
while the weighted degree simply becomes the degree. A high
'avg.weight'
in this case measures how interconnected a module is in
the test dataset. A high degree indicates that a node is connected to
many other nodes. The interpretation of the 'cor.degree'
remains
unchanged between weighted and unweighted networks. If the network is
directed the interpretation of the 'avg.weight'
remains unchanged,
while the cor.degree will measure the concordance of the node
in-degree in the test network. To measure the out-degree
instead, the adjacency matrices provided to the 'network'
argument
should be transposed.
Caution should be used when running NetRep
on sparse data (i.e. where there are many zero values in the data
used to infer the network). For this data, the average node contribution
('avg.contrib'
), concordance of node contribution
('cor.contrib'
), and module coherence ('coherence'
)
will all be systematically underestimated due to their reliance on the
Pearson correlation coefficient to calculate the node contribution.
Care should also be taken to use appropriate methods for inferring the correlation structure when the data is sparse for the same reason.
Caution should be used when running NetRep
on proportional data (
i.e. where observations across samples all sum to the same value,
e.g. 1). For this data, the average node contribution
('avg.contrib'
), concordance of node contribution
('cor.contrib'
), and module coherence ('coherence'
)
will all be systematically overestimated due to their reliance on the
Pearson correlation coefficient to calculate the node contribution.
Care should also be taken to use appropriate methods for inferring the correlation structure from proportional data for the same reason.
Three alternative hypotheses are available. "greater", the default, tests whether each module preservation statistic is larger than expected by chance. "lesser" tests whether each module preservation statistic is smaller than expected by chance, which may be useful for identifying modules that are extremely different in the test dataset. "two.sided" can be used to test both alternate hypotheses.
To determine whether a module preservation statistic deviates from chance, a
permutation procedure is employed. Each statistic is calculated between the
module in the discovery dataset and nPerm
random subsets of
the same size in the test dataset in order to assess the distribution
of each statistic under the null hypothesis.
Two models for the null hypothesis are available: "overlap", the default, only nodes that are present in both the discovery and test networks are used when generating null distributions. This is appropriate under an assumption that nodes that are present in the test dataset, but not present in the discovery dataset, are unobserved: that is, they may fall in the module(s) of interest in the discovery dataset if they were to be measured there. Alternatively, "all" will use all nodes in the test network when generating the null distributions.
The number of permutations required for any given significance threshold is
approximately 1 / the desired significance for one sided tests, and double
that for two-sided tests. This can be calculated with
requiredPerms
. When nPerm
is not specified, the number
of permutations is automatically calculated as the number required for a
Bonferroni corrected significance threshold adjusting for the total number
of tests for each statistic, i.e. the total number of modules to be analysed
multiplied by the number of test datasets each module is tested in.
Although assessing the replication of a small numberof modules calls for
very few permutations, we recommend using no fewer than 1,000 as fewer
permutations are unlikely to generate representative null distributions.
Note: the assumption used by requiredPerms
to
determine the correct number of permtutations breaks down when assessing the
preservation of modules in a very small dataset (e.g. gene sets in a dataset
with less than 100 genes total). However, the reported p-values will still
be accurate (see permutationTest
) (3).
Ritchie, S.C., et al., A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell Systems. 3, 71-82 (2016).
Langfelder, P., Luo, R., Oldham, M. C. & Horvath, S. Is my network module preserved and reproducible? PLoS Comput. Biol. 7, e1001057 (2011).
Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9, Article39 (2010).
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).
Functions for: visualising network modules, calculating module topology, calculating permutation test P-values, and splitting computation over multiple machines.
# load in example data, correlation, and network matrices for a discovery and test dataset:
data("NetRep")
# Set up input lists for each input matrix type across datasets. The list
# elements can have any names, so long as they are consistent between the
# inputs.
network_list <- list(discovery=discovery_network, test=test_network)
data_list <- list(discovery=discovery_data, test=test_data)
correlation_list <- list(discovery=discovery_correlation, test=test_correlation)
labels_list <- list(discovery=module_labels)
# Assess module preservation: you should run at least 10,000 permutations
preservation <- modulePreservation(
network=network_list, data=data_list, correlation=correlation_list,
moduleAssignments=labels_list, nPerm=1000, discovery="discovery",
test="test", nThreads=2
)
Run the code above in your browser using DataLab