Learn R Programming

NetRep (version 1.2.7)

modulePreservation: Replication and preservation of network modules across datasets

Description

Quantify the preservation of network modules (sub-graphs) in an independent dataset through permutation testing on module topology. Seven network statistics (see details) are calculated for each module and then tested by comparing to distributions generated from their calculation on random subsets in the test dataset.

Usage

modulePreservation(
  network,
  data,
  correlation,
  moduleAssignments,
  modules = NULL,
  backgroundLabel = "0",
  discovery = 1,
  test = 2,
  selfPreservation = FALSE,
  nThreads = NULL,
  nPerm = NULL,
  null = "overlap",
  alternative = "greater",
  simplify = TRUE,
  verbose = TRUE
)

Value

A nested list structure. At the top level, the list has one element per

'discovery' dataset. Each of these elements is a list that has one element per 'test' dataset analysed for that 'discovery'

dataset. Each of these elements is also a list, containing the following objects:

  • observed: A matrix of the observed values for the module preservation statistics. Rows correspond to modules, and columns to the module preservation statistics.

  • nulls: A three dimensional array containing the values of the module preservation statistics evaluated on random permutation of module assignment in the test network. Rows correspond to modules, columns to the module preservation statistics, and the third dimension to the permutations.

  • p.values: A matrix of p-values for the observed module preservation statistics as evaluated through a permutation test using the corresponding values in nulls.

  • nVarsPresent: A vector containing the number of variables that are present in the test dataset for each module.

  • propVarsPresent: A vector containing the proportion of variables present in the test dataset for each module. Modules where this is less than 1 should be investigated further before making judgements about preservation to ensure that the missing variables are not the most connected ones.

  • contingency: If moduleAssignments are present for both the discovery and test datasets, then a contingency table showing the overlap between modules across datasets is returned. Rows correspond to modules in the discovery dataset, columns to modules in the test dataset.

When simplify = TRUE then the simplest possible structure will be returned. E.g. if module preservation is tested in only one dataset, then the returned list will have only the above elements.

When simplify = FALSE then a nested list of datasets will always be returned, i.e. each element at the top level and second level correspond to a dataset, e.g. results[["Dataset1"]][["Dataset2"]] indicates an analysis where modules discovered in "Dataset1" are assessed for preservation in "Dataset2". Dataset comparisons which have not been assessed will contain NULL.

Arguments

network

a list of interaction networks, one for each dataset. Each entry of the list should be a \(n * n\) matrix or where each element contains the edge weight between nodes \(i\) and \(j\) in the inferred network for that dataset.

data

a list of matrices, one for each dataset. Each entry of the list should be the data used to infer the interaction network for that dataset. The columns should correspond to variables in the data (nodes in the network) and rows to samples in that dataset.

correlation

a list of matrices, one for each dataset. Each entry of the list should be a \(n * n\) matrix where each element contains the correlation coefficient between nodes \(i\) and \(j\) in the data used to infer the interaction network for that dataset.

moduleAssignments

a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.

modules

a list of vectors, one for each discovery dataset, of modules to perform the analysis on. If unspecified, all modules in each discovery dataset will be analysed, with the exception of those specified in backgroundLabel argument.

backgroundLabel

a single label given to nodes that do not belong to any module in the moduleAssignments argument. Defaults to "0". Set to NULL if you do not want to skip the network background module.

discovery

a vector of names or indices denoting the discovery dataset(s) in the data, correlation, network, moduleAssignments, modules, and test lists.

test

a list of vectors, one for each discovery dataset, of names or indices denoting the test dataset(s) in the data, correlation, and network lists.

selfPreservation

logical; if FALSE (default) then module preservation analysis will not be performed within a dataset (i.e. where the discovery and test datasets are the same).

nThreads

number of threads to parallelise the calculation of network properties over. Automatically determined as the number of cores - 1 if not specified.

nPerm

number of permutations to use. If not specified, the number of permutations will be automatically determined (see details). When set to 0 the permutation procedure will be skipped and the observed module preservation will be returned without p-values.

null

variables to include when generating the null distributions. Must be either "overlap" or "all" (see details).

alternative

The type of module preservation test to perform. Must be one of "greater" (default), "less" or "two.sided" (see details).

simplify

logical; if TRUE, simplify the structure of the output list if possible (see Return Value).

verbose

logical; should progress be reported? Default is TRUE.

Details

Input data structures:

The preservation of network modules in a second dataset is quantified by measuring the preservation of topological properties between the discovery and test datasets. These properties are calculated not only from the interaction networks inferred in each dataset, but also from the data used to infer those networks (e.g. gene expression data) as well as the correlation structure between variables/nodes. Thus, all functions in the NetRep package have the following arguments:

  • network: a list of interaction networks, one for each dataset.

  • data: a list of data matrices used to infer those networks, one for each dataset.

  • correlation: a list of matrices containing the pairwise correlation coefficients between variables/nodes in each dataset.

  • moduleAssignments: a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.

  • modules: a list of vectors, one for each discovery dataset, containing the names of the modules from that dataset to analyse.

  • discovery: a vector indicating the names or indices of the previous arguments' lists to use as the discovery dataset(s) for the analyses.

  • test: a list of vectors, one vector for each discovery dataset, containing the names or indices of the network, data, and correlation argument lists to use as the test dataset(s) for the analysis of each discovery dataset.

The formatting of these arguments is not strict: each function will attempt to make sense of the user input. For example, if there is only one discovery dataset, then input to the moduleAssigments and test arguments may be vectors, rather than lists.

Analysing large datasets:

Matrices in the network, data, and correlation lists can be supplied as disk.matrix objects. This class allows matrix data to be kept on disk and loaded as required by NetRep. This dramatically decreases memory usage: the matrices for only one dataset will be kept in RAM at any point in time.

Additional memory usage of the permutation procedure is directly proportional to the sum of module sizes squared multiplied by the number of threads. Very large modules may result in significant additional memory usage per core due to extraction of the correlation coefficient sub-matrix at each permutation.

Module Preservation Statistics:

Module preservation is assessed through seven module preservation statistics, each of which captures a different aspect of a module's topology; i.e. the structure of the relationships between its nodes (1,2). Below is a description of each statistic, what they seek to measure, and where their interpretation may be inappropriate.

The module coherence ('coherence'), average node contribution ('avg.contrib'), and concordance of node contribution ('cor.contrib') are all calculated from the data used to infer the network (provided in the 'data' argument). They are calculated from the module's summary profile. This is the eigenvector of the 1st principal component across all observations for every node composing the module. For gene coexpression modules this can be interpreted as a "summary expression profile". It is typically referred to as the "module eigengene" in the weighted gene coexpression network analysis literature (4).

The module coherence ('coherence') quantifies the proportion of module variance explained by the module's "summary profile". The higher this value, the more "coherent" the data is, i.e. the more similar the observations are nodes for each sample. With the default alternate hypothesis, a small permutation P-value indicates that the module is more coherent than expected by chance.

The average node contribution ('avg.contrib') and concordance of node contribution ('cor.contrib') are calculated from the node contribution, which quantifies how similar each node is to the modules's summary profile. It is calculated as the Pearson correlation coefficient between each node and the module summary profile. In the weighted gene coexpression network literature it is typically called the "module membership" (2).

The average node contribution ('avg.contrib') quantifies how similar nodes are to the module summary profile in the test dataset. Nodes detract from this score where the sign of their node contribution flips between the discovery and test datasets, e.g. in the case of differential gene expression across conditions. A high average node contribution with a small permutation P-value indicates that the module remains coherent in the test dataset, and that the nodes are acting together in a similar way.

The concordance of node contribution ('cor.contrib') measures whether the relative rank of nodes (in terms of their node contribution) is preserved across datasets. If a module is coherent enough that all nodes contribute strongly, then this statistic will not be meaningful as its value will be heavily influenced by tiny variations in node rank. This can be assessed through visualisation of the module topology (see plotContribution.) Similarly, a strong 'cor.contrib' is unlikely to be meaningful if the 'avg.contrib' is not significant.

The concordance of correlation strucutre ('cor.cor') and density of correlation structure ('avg.cor') are calculated from the user-provided correlation structure between nodes (provided in the 'correlation' argument). This is referred to as "coexpression" when calculated on gene expression data.

The 'avg.cor' measures how strongly nodes within a module are correlation on average in the test dataset. This average depends on the correlation coefficients in the discovery dataset: the score is penalised where correlation coefficients change in sign between datasets. A high 'avg.cor' with a small permutation P-value indicates that the module is (a) more strongly correlated than expected by chance for a module of the same size, and (b) more consistently correlated with respect to the discovery dataset than expected by chance.

The 'cor.cor' measures how similar the correlation coefficients are across the two datasets. A high 'cor.cor' with a small permutation P-value indicates that the correlation structure within a module is more similar across datasets than expected by chance. If all nodes within a module are very similarly correlated then this statistic will not be meaningful, as its value will be heavily influenced by tiny, non-meaningful, variations in correlation strength. This can be assessed through visualisation of the module topology (see plotCorrelation.) Similarly, a strong 'cor.cor' is unlikely to be meaningful if the 'avg.cor' is not significant.

The average edge weight ('avg.weight') and concordance of weighted degree ('cor.degree') are both calculated from the interaction network (provided as adjacency matrices to the 'network' argument).

The 'avg.weight' measures the average connection strength between nodes in the test dataset. In the weighted gene coexpression network literature this is typically called the "module density" (2). A high 'avg.weight' with a small permutation P-value indicates that the module is more strongly connected in the test dataset than expected by chance.

The 'cor.degree' calculates whether the relative rank of each node's weighted degree is similar across datasets. The weighted degree is calculated as the sum of a node's edge weights to all other nodes in the module. In the weighted gene coexpression network literature this is typically called the "intramodular connectivity" (2). This statistic will not be meaningful where all nodes are connected to each other with similar strength, as its value will be heavily influenced by tiny, non-meaningful, variations in weighted degree. This can be assessed through visualisation of the module topology (see plotDegree.)

Both the 'avg.weight' and 'cor.degree' assume edges are weighted, and that the network is densely connected. Note that for sparse networks, edges with zero weight are included when calculating both statistics. Only the magnitude of the weights, not their sign, contribute to the score. If the network is unweighted, i.e. edges indicate presence or absence of a relationship, then the 'avg.weight' will be the proportion of the number of edges to the total number of possible edges while the weighted degree simply becomes the degree. A high 'avg.weight' in this case measures how interconnected a module is in the test dataset. A high degree indicates that a node is connected to many other nodes. The interpretation of the 'cor.degree' remains unchanged between weighted and unweighted networks. If the network is directed the interpretation of the 'avg.weight' remains unchanged, while the cor.degree will measure the concordance of the node in-degree in the test network. To measure the out-degree instead, the adjacency matrices provided to the 'network' argument should be transposed.

Sparse data:

Caution should be used when running NetRep on sparse data (i.e. where there are many zero values in the data used to infer the network). For this data, the average node contribution ('avg.contrib'), concordance of node contribution ('cor.contrib'), and module coherence ('coherence') will all be systematically underestimated due to their reliance on the Pearson correlation coefficient to calculate the node contribution.

Care should also be taken to use appropriate methods for inferring the correlation structure when the data is sparse for the same reason.

Proportional data:

Caution should be used when running NetRep on proportional data ( i.e. where observations across samples all sum to the same value, e.g. 1). For this data, the average node contribution ('avg.contrib'), concordance of node contribution ('cor.contrib'), and module coherence ('coherence') will all be systematically overestimated due to their reliance on the Pearson correlation coefficient to calculate the node contribution.

Care should also be taken to use appropriate methods for inferring the correlation structure from proportional data for the same reason.

Hypothesis testing:

Three alternative hypotheses are available. "greater", the default, tests whether each module preservation statistic is larger than expected by chance. "lesser" tests whether each module preservation statistic is smaller than expected by chance, which may be useful for identifying modules that are extremely different in the test dataset. "two.sided" can be used to test both alternate hypotheses.

To determine whether a module preservation statistic deviates from chance, a permutation procedure is employed. Each statistic is calculated between the module in the discovery dataset and nPerm random subsets of the same size in the test dataset in order to assess the distribution of each statistic under the null hypothesis.

Two models for the null hypothesis are available: "overlap", the default, only nodes that are present in both the discovery and test networks are used when generating null distributions. This is appropriate under an assumption that nodes that are present in the test dataset, but not present in the discovery dataset, are unobserved: that is, they may fall in the module(s) of interest in the discovery dataset if they were to be measured there. Alternatively, "all" will use all nodes in the test network when generating the null distributions.

The number of permutations required for any given significance threshold is approximately 1 / the desired significance for one sided tests, and double that for two-sided tests. This can be calculated with requiredPerms. When nPerm is not specified, the number of permutations is automatically calculated as the number required for a Bonferroni corrected significance threshold adjusting for the total number of tests for each statistic, i.e. the total number of modules to be analysed multiplied by the number of test datasets each module is tested in. Although assessing the replication of a small numberof modules calls for very few permutations, we recommend using no fewer than 1,000 as fewer permutations are unlikely to generate representative null distributions. Note: the assumption used by requiredPerms to determine the correct number of permtutations breaks down when assessing the preservation of modules in a very small dataset (e.g. gene sets in a dataset with less than 100 genes total). However, the reported p-values will still be accurate (see permutationTest) (3).

References

  1. Ritchie, S.C., et al., A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell Systems. 3, 71-82 (2016).

  2. Langfelder, P., Luo, R., Oldham, M. C. & Horvath, S. Is my network module preserved and reproducible? PLoS Comput. Biol. 7, e1001057 (2011).

  3. Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9, Article39 (2010).

  4. Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).

See Also

Functions for: visualising network modules, calculating module topology, calculating permutation test P-values, and splitting computation over multiple machines.

Examples

Run this code
# load in example data, correlation, and network matrices for a discovery and test dataset:
data("NetRep")

# Set up input lists for each input matrix type across datasets. The list
# elements can have any names, so long as they are consistent between the
# inputs.
network_list <- list(discovery=discovery_network, test=test_network)
data_list <- list(discovery=discovery_data, test=test_data)
correlation_list <- list(discovery=discovery_correlation, test=test_correlation)
labels_list <- list(discovery=module_labels)

# Assess module preservation: you should run at least 10,000 permutations
preservation <- modulePreservation(
 network=network_list, data=data_list, correlation=correlation_list, 
 moduleAssignments=labels_list, nPerm=1000, discovery="discovery", 
 test="test", nThreads=2
)

Run the code above in your browser using DataLab