FusionLearn
package implements a new learning algorithm to integrate information from different experimental platforms. The algorithm applies the grouped penalization method in the pseudolikelihood setting.
In the context of fusion learning, there are \(k\) different data sets from \(k\) different experimental platforms. The data from each platform can be modeled by a different generalized linear model. Assume the same set of predictors \(\{M_1,M_2,...,M_j,...,M_p \}\) are measured across \(k\) different experimental platforms.
Platforms | Formula | \(M_1\) | \(M_2\) | \(\dots\) | \(M_j\) | \(\dots\) | \(M_p\) |
1 | \(y_1: g_1(\mu_1) \sim\) | \(x_{11}\beta_{11}+\) | \(x_{12}\beta_{12}+\) | \(\dots\) | \(x_{1j}\beta_{1j}+\) | \(\dots\) | \(x_{1p}\beta_{1p}\) |
2 | \(y_2: g_2(\mu_2) \sim\) | \(x_{21}\beta_{21}+\) | \(x_{22}\beta_{22}+\) | \(\dots\) | \(x_{2j}\beta_{2j}+\) | \(\dots\) | \(x_{2p}\beta_{2p}\) |
... | k | \(y_k: g_k(\mu_k) \sim\) | \(x_{k1}\beta_{k1}+\) | \(x_{k2}\beta_{k2}+\) | \(\dots\) | ||
\(x_{kj}\beta_{kj}+\) | \(\dots\) | \(x_{kp}\beta_{kp}\) | Platforms | Formula | \(M_1\) | \(M_2\) | \(\dots\) |
Here \(x_{kj}\) represents the observation of the predictor \(M_j\) on the \(k\)th platform, and \(\beta^{(j)}\) denotes the vector of regression coefficients for the predictor \(M_j\).
Platforms | \(\bold{M_j}\) | \(\bold{\beta^{(j)}}\) | |
1 | \(x_{1j}\) | \(\beta_{1j}\) | |
2 | \(x_{2j}\) | \(\beta_{2j}\) | |
... | ... |
Consider the following examples.
Example 1. Suppose \(k\) different types of experiments are conducted to study the genetic mechanism of a disease. The predictors in this research are different facets of individual genes, such as mRNA expression, protein expression, RNAseq expression and so on. The goal is to select the genes which affect the disease, while the genes are assessed in a number of ways through different measurement processes across \(k\) experimental platforms.
Example 2. The predictive models for three different financial indices are simultaneously built from a panel of stock index predictors. In this case, the predictor values across different models are the same, but the regression coefficients are different.
In the conventional approach, the model for each of the \(k\) platforms is analyzed separately. FusionLearn
algorithm selects significant predictors through learning from multiple models. The overall objective is to minimize the function:
$$Q(\beta)=l_I(\beta)- n \sum_{j=1}^{p} \Omega_{\lambda_n} ||\beta^{(j)}||,$$
with \(p\) being the numbers of predictors, \(\Omega_{\lambda_n}\) being the penalty functions, and \(||\beta^{(j)}|| = (\sum_{i=1}^{k}\beta_{ij}^2)^{1/2}\) denoting the \(L_2\)-norm of the coefficients of the predictor \(M_j\).
The user can specify the penalty function \(\Omega_{\lambda_n}\) and the penalty values \(\lambda_n\). This package also contains functions to provide the pseudolikelihood Bayesian information criterion:
$$ pseu-BIC(s) = -2l_I(\hat{\beta}_I;Y) + d_s^{*} \gamma_n $$
with \(-2l_I(\hat{\beta}_I; Y)\) denoting the pseudo loglikelihood, \(d_s^{*}\) measuring the model complexity and \(\gamma_n\) being the penalty on the model complexity.
The basic function fusionbase
deals with continuous responses. The function fusionbinary
is applied to binary responses, and the function fusionmixed
is applied to a mix of continuous and binary responses.
Gao, X and Carroll, R. J. (2017) Data integration with high dimensionality. Biometrika, 104, 2, pp. 251-272