NetGSA: Network-based Gene Set Analysis

Description

Tests the significance of pre-defined sets of genes (pathways) with respect to an outcome variable, such as the condition indicator (e.g. cancer vs. normal, etc.), based on the underlying biological networks.

Usage

NetGSA(A, x, group, pathways, lklMethod = "REHE", 
       sampling=FALSE, sample_n = NULL, sample_p = NULL, minsize=5, 
       eta = 0.1, lim4kappa = 500)

Value

A list with components

results: A data frame with pathway names, pathway sizes, p-values and false discovery rate corrected q-values, and test statistic for all pathways.
beta: Vector of fixed effects of length $kp$, the first k elements corresponds to condition 1, the second k to condition 2, etc
s2.epsilon: Variance of the random errors $\epsilon$.
s2.gamma: Variance of the random effects $\gamma$.
graph: List of components needed in plot.NetGSA.

Arguments

A: A list of weighted adjacency matrices. Typically returned from prepareAdjMat
x: The $p \times n$ data matrix with rows referring to genes and columns to samples. It is very important that the adjacency matrices A share the same rownames as the data matrix x.
group: Vector of class indicators of length $n$.
pathways: The npath by $p$ indicator matrix for pathways.
lklMethod: Method used for variance component calculation: options are ML (maximum likelihood), REML (restricted maximum likelihood), HE (Haseman-Elston regression) or REHE (restricted Haseman-Elston regression). See details.
sampling: (Logical) whether to subsample the observations and/or variables. See details.
sample_n: The ratio for subsampling the observations if sampling=TRUE.
sample_p: The ratio for subsampling the variables if sampling=TRUE.
minsize: Minimum number of genes in pathways to be considered.
eta: Approximation limit for the Influence matrix. See 'Details'.
lim4kappa: Limit for condition number (used to adjust eta). See 'Details'.

Author

Ali Shojaie and Jing Ma

Details

The function NetGSA carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It can be used for gene set (pathway) enrichment analysis where the data come from $K$ heterogeneous conditions, where $K$, or more. NetGSA differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks. Therefore, when the networks encoded in A are empty, one should instead consider alternative approaches such as Gene Set Analysis (Efron and Tibshirani, 2007).

The NetGSA method is formulated in terms of a mixed linear model. Let $X$ represent the rearrangement of data x into an $np \times 1$ column vector. $$X=\Psi \beta + \Pi \gamma + \epsilon$$ where $\beta$ is the vector of fixed effects, $\gamma$ and $\epsilon$ are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices, which determine the influence matrix under each condition. The influence matrices further determine the design matrices $\Psi$ and $\Pi$ in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and is calculated from the adjacency matrix (A[[k]] for the $k$-th condition). A small value of eta is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa.)

The problem is then to test the null hypothesis $\ell\beta = 0$ against the alternative $\ell\beta \neq 0$, where $\ell$ is a contrast vector, optimally defined through the underlying networks. For a one-sample or two-sample test, the test statistic $T$ for each gene set has approximately a t-distribution under the null, whose degrees of freedom are estimated using the Satterthwaite approximation method. When analyzing complex experiments involving multiple conditions, often multiple contrast vectors of interest are considered for a specific subnetwork. Alternatively, one can combine the contrast vectors into a contrast matrix $L$. A different test statistic $F$ will be used. Under the null, $F$ has an F-distribution, whose degrees of freedom are calculated based on the contrast matrix $L$ as well as variances of $\gamma$ and $\epsilon$. The fixed effects $\beta$ are estimated by generalized least squares, and the estimate depends on estimated variance components of $\gamma$ and $\epsilon$.

Estimation of the variance components ($\sigma^2_{\epsilon}$ and $\sigma^2_{\gamma}$) can be done in several different ways after profiling out $\sigma^2_{\epsilon}$, including REML/ML which uses Newton's method or HE/REHE which is based on the Haseman-Elston regression method. The latter notes the fact that $Var(X)=\sigma^2_{\gamma}\Pi*\Pi' + \sigma^2_{\epsilon}I$, and uses an ordinary least squares to solve for the unknown coefficients after vectorizing both sides. In particular, REHE uses nonnegative least squares for the regression and therefore ensures nonnegative estimate of the variance components. Due to the simple formulation, HE/REHE also allows subsampling with respect to both the samples and the variables, and is recommended especially when the problem is large (i.e. large $p$ and/or large $n$).

The pathway membership information is stored in pathways, which should be a matrix of $npath$ x $p$. See prepareAdjMat for details on how to prepare a suitable pathway membership object.

This function can deal with both directed and undirected networks, which are specified via the option directed. Note NetGSA uses slightly different procedures to calculate the influence matrices for directed and undirected networks. In either case, the user can still apply NetGSA if only partial information on the adjacency matrices is available. The functions netEst.undir and netEst.dir provide details on how to estimate the weighted adjacency matrices from data based on available network information.

References

Ma, J., Shojaie, A. & Michailidis, G. (2016) Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 32(20):165--3174. tools:::Rd_expr_doi("10.1093/bioinformatics/btw410")

Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22. https://pubmed.ncbi.nlm.nih.gov/20597848/.

Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3131840/

Examples

Run this code

# \donttest{
## load the data
data("breastcancer2012_subset")

## consider genes from just 2 pathways
genenames    <- unique(c(pathways[["Adipocytokine signaling pathway"]], 
                         pathways[["Adrenergic signaling in cardiomyocytes"]]))
sx           <- x[match(rownames(x), genenames, nomatch = 0L) > 0L,]

db_edges       <- obtainEdgeList(rownames(sx), databases = c("kegg", "reactome"))
adj_cluster    <- prepareAdjMat(sx, group, databases = db_edges, cluster = TRUE)
out_cluster    <- NetGSA(adj_cluster[["Adj"]], sx, group, 
                         pathways_mat[c(1,2), rownames(sx)], lklMethod = "REHE", sampling = FALSE)
# }

Run the code above in your browser using DataLab