example-data: Example data

Description

Example gene coexpression networks inferred from two independent datasets to demonstrate the usage of package functions.

Usage

data("NetRep")

Arguments

Format

"discovery_network": a matrix with 150 columns and 150 rows containing the network edge weights encoding the interaction strength between each pair of genes in the discovery dataset.
"discovery_data": a matrix with 150 columns (genes) and 30 rows (samples) whose entries correspond to the expression level of each gene in each sample in the discovery dataset.
"discovery_correlation": a matrix with 150 columns and 150 rows containing the correlation-coefficients between each pair of genes calculated from the "discovery_data" matrix.
\"module_labels": a named vector with 150 entries containing the module assignment for each gene as identified in the discovery dataset.
"test_network": a matrix with 150 columns and 150 rows containing the network edge weights encoding the interaction strength between each pair of genes in the test dataset.
"test_data": a matrix with 150 columns (genes) and 30 rows (samples) whose entries correspond to the expression level of each gene in each sample in the test dataset.
"test_correlation": a matrix with 150 columns and 150 rows containing the correlation-coefficients between each pair of genes calculated from the "test_data" matrix.

An object of class matrix (inherits from array) with 150 rows and 150 columns.

An object of class matrix (inherits from array) with 30 rows and 150 columns.

An object of class matrix (inherits from array) with 150 rows and 150 columns.

An object of class numeric of length 150.

An object of class matrix (inherits from array) with 150 rows and 150 columns.

An object of class matrix (inherits from array) with 30 rows and 150 columns.

An object of class matrix (inherits from array) with 150 rows and 150 columns.

Simulation details

The discovery gene expression dataset ("discovery_data") containing 30 samples and 150 genes was simulated to contain four distinct modules of sizes 20, 25, 30, and 35 genes. Data for each module were simulated as: $$ G^{(w)}_{simulated} = E^{(w)} r_i + \sqrt{1 - r^2_i} \epsilon $$ Where $E^{(w)}$ is the simulated module's summary vector, $r$ is the simulated module's node contributions for each gene, and $\epsilon$ is the error term drawn from a standard normal distribution. $E^{(w)}$ and $r$ were simulated by bootstrapping (sampling with replacement) samples and genes from the corresponding vectors in modules 63, 51, 57, and 50 discovered in the liver tissue gene expression data from a publicly available mouse dataset (see reference (1) for details on the dataset and network discovery). The remaining 40 genes that were not part of any module were simulated by randomly selecting 40 liver genes and bootstrapping 30 samples and adding the noise term, $\epsilon$. A vector of module assignments was created ("module_labels") in which each gene was labelled with a number 1-4 corresponding to the module they were simulated to be coexpressed with, or a label of 0 for the for the 40 "background" genes not participating in any module. The correlation structure ("discovery_correlation") was calculated as the Pearson's correlation coefficient between genes (cor(discovery_data)). Edge weights in the interaction network ("discovery_network") were calculated as the absolute value of the correlation coefficient exponentiated to the power 5 (abs(discovery_correlation)^5).

An independent test dataset ("test_data") containing the same 150 genes as the discovery dataset but 30 different samples was simulated as above. Modules 1 and 4 (containing 20 and 35 genes respectively) were simulated to be preserved using the same equation above, where the summary vector $E^{(w)}$ was bootstrapped from the same liver modules (modules 63 and 50) as in the discovery and with identical node contributions $r$ as in the discovery dataset. Genes in modules 2 and 3 were simulated as "background" genes, i.e. not preserved as described above. The correlation structure between genes in the test dataset ("test_correlation") and the interaction network ("test_network") were calculated the same way as in the discovery dataset.

The random seed used for the simulations was 37.

Details

The preservation of network modules in a second dataset is quantified by measuring the preservation of topological properties between the discovery and test datasets. These properties are calculated not only from the interaction networks inferred in each dataset, but also from the data used to infer those networks (e.g. gene expression data) as well as the correlation structure between variables/nodes. Thus, all functions in the NetRep package have the following arguments:

network:: a list of interaction networks, one for each dataset.
data:: a list of data matrices used to infer those networks, one for each dataset.
correlation:: a list of matrices containing the pairwise correlation coefficients between variables/nodes in each dataset.
moduleAssignments:: a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.
modules:: a list of vectors, one vector for each discovery dataset, containing the names of the modules from that dataset to analyse.
discovery:: a vector indicating the names or indices of the previous arguments' lists to use as the discovery dataset(s) for the analyses.
test:: a list of vectors, one vector for each discovery dataset, containing the names or indices of the network, data, and correlation argument lists to use as the test dataset(s) for the analysis of each discovery dataset.

This data is used to provide concrete examples of the usage of these arguments in each package function.

References

Ritchie, S.C., et al., A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell Systems. 3, 71-82 (2016).

Description

Usage

Arguments

Format

Simulation details

Details

References

See Also