FuncompCGL: Fit regularization paths of sparse log-contrast regression with functional compositional predictors.

Description

Fit the penalized log-contrast regression with functional compositional predictors proposed by Zhe et al. (2020) <arXiv:1808.02403>. The model estimation is conducted by minimizing a linearly constrained group lasso criterion. The regularization paths are computed for the group lasso penalty at grid values of the regularization parameter lam and the degree of freedom of the basis function K.

Usage

FuncompCGL(y, X, Zc = NULL, intercept = TRUE, ref = NULL,
           k, degree = 3, basis_fun = c("bs", "OBasis", "fourier"),
           insert = c("FALSE", "X", "basis"), method = c("trapezoidal", "step"),
           interval = c("Original", "Standard"), Trange,
           T.name = "TIME", ID.name = "Subject_ID",
           W = rep(1,times = p - length(ref)),
           dfmax = p - length(ref), pfmax = min(dfmax * 1.5, p - length(ref)),
           lam = NULL, nlam = 100, lambda.factor = ifelse(n < p1, 0.05, 0.001),
           tol = 1e-8, mu_ratio = 1.01,
           outer_maxiter = 1e+6, outer_eps = 1e-8,
           inner_maxiter = 1e+4, inner_eps = 1e-8)

Arguments

response vector with length n.

data frame or matrix.

If nrow(X) > $n$, X should be a data frame or matrix of the functional compositional predictors with $p$ columns for the values of the composition components, a column indicating subject ID and a column of observation times. Order of Subject ID should be the SAME as that of y. Zero entry is not allowed.
If nrow(X)[1]=$n$, X is considered as after taken integration, a n*(k*p - length(ref)) matrix.

a $n\times p_{c}$ design matrix of unpenalized variables. Default is NULL.

intercept

Boolean, specifying whether to include an intercept. Default is TRUE.

ref

reference level (baseline), either an integer between $[1,p]$ or NULL. Default value is NULL.

If ref is set to be an integer between [1,p], the group lasso penalized log-contrast model (with log-ratios) is fitted with the ref-th component chosed as baseline.
If ref is set to be NULL, the linearly constrained group lasso penalized log-contrast model is fitted.

an integer, degrees of freedom of the basis function.

degree

degrees of freedom of the basis function. Default value is 3.

basis_fun

method to generate basis:

"bs"(Default) B-splines. See fucntion bs.
"OBasis" Orthoganal B-splines. See function OBasis and package orthogonalsplinebasis.
"fourier" Fourier basis. See fucntion create.fourier.basis and package fda.

insert

a character string sepcifying method to perform functional interpolation.

"FALSE"(Default) no interpolation.
"X" linear interpolation of functional compositional data along the time grid.
"basis" the functional compositional data is interplolated as a step function along the time grid.

If insert = "X" or "basis", interplolation is conducted on sseq, where sseq is the sorted sequence of all the observed time points.

method

a character string sepcifying method used to approximate integral.

"trapezoidal"(Default) Sum up areas under the trapezoids.
"step" Sum up area under the rectangles.

interval

a character string sepcifying the domain of the integral.

"Original"(Default) On the original time scale, interval = range(Time).
"Standard" Time points are mapped onto $[0,1]$, interval = c(0,1).

Trange

range of time points

T.name, ID.name

a character string specifying names of the time variable and the Subject ID variable in X. This is only needed when X is a data frame or matrix of the functional compositional predictors. Default are "TIME" and "Subject_ID".

a vector of length p (the total number of groups), or a matrix with dimension $p_{1} \times p_{1}$, where p1=(p - length(ref)) * k, or character specifying the function used to calculate weight matrix for each group.

a vector of penalization weights for the groups of coefficients. A zero weight implies no shrinkage.
a diagonal matrix with positive diagonal elements.
if character string of function name or an object of type function to compute the weights.

dfmax

limit the maximum number of groups in the model. Useful for handling very large $p$, if a partial path is desired. Default is $p$.

pfmax

limit the maximum number of groups ever to be nonzero. For example once a group enters the model along the path, no matter how many times it re-enters the model through the path, it will be counted only once. Default is min(dfmax*1.5, p).

lam

a user supplied lambda sequence. If lam is provided as a scaler and nlam$>1$, lam sequence is created starting from lam. To run a single value of lam, set nlam$=1$. The program will sort user-defined lambda sequence in decreasing order.

nlam

the length of the lam sequence. Default is 100. No effect if lam is provided.

lambda.factor

the factor for getting the minimal lambda in lam sequence, where min(lam) = lambda.factor * max(lam). max(lam) is the smallest value of lam for which all penalized group are 0's. If $n >= p1$, the default is 0.001. If $n < p1$, the default is 0.05.

tol

tolerance for coefficient to be considered as non-zero. Once the convergence criterion is satisfied, for each element $\beta_j$ in coefficient vector $\beta$, $\beta_j = 0$ if $\beta_j < tol$.

mu_ratio

the increasing ratio of the penalty parameter u. Default value is 1.01. Inital values for scaled Lagrange multipliers are set as 0's. If mu_ratio < 1, the program automatically set the initial penalty parameter u as 0 and outer_maxiter as 1, indicating that there is no linear constraint.

outer_maxiter, outer_eps

outer_maxiter is the maximum number of loops allowed for the augmented Lanrange method; and outer_eps is the corresponding convergence tolerance.

inner_maxiter, inner_eps

inner_maxiter is the maximum number of loops allowed for blockwise-GMD; and inner_eps is the corresponding convergence tolerance.

Value

An object with S3 class "FuncompCGL", which is a list containing:

the integral matrix for the functional compositional predictors with dimension $n \times (pk)$.

lam

the sequence of lam values.

the number of non-zero groups in the estimated coefficients for the functional compositional predictors at each value of lam.

beta

a matrix of coefficients with length(lam) columns and $p_{1}+p_{c}+1$ rows, where p_1=p*k. The first $p_{1}$ rows are the estimated values for the coefficients for the functional compositional preditors, and the last row is for the intercept. If intercept = FALSE, the last row is 0's.

dim

dimension of the coefficient matrix.

sseq

sequence of the time points.

call

the call that produces this object.

Details

The functional log-contrast regression model for compositional predictors is defined as $$ y = 1_n\beta_0 + Z_c\beta_c + \int_T Z(t)\beta(t)dt + e, s.t. (1_p)^T \beta(t)=0 \forall t \in T, $$ where $\beta_0$ is the intercept, $\beta_c$ is the regression coefficient vector with length $p_c$ corresponding to the control variables, $\beta(t)$ is the functional regression coefficient vector with length $p$ as a funtion of $t$ and $e$ is the random error vector with zero mean with length $n$. Moreover, Z(t) is the log-transformed functional compostional data. If zero(s) exists in the original functional compositional data, user should pre-process these zero(s). For example, if count data provided, user could replace 0's with 0.5.

After adopting a truncated basis expansion approach to re-express $\beta(t)$ $$\beta(t) = B \Phi(t),$$ where $B$ is a p-by-k unkown but fixed coefficient matrix, and $\Phi(t)$ consists of basis with degree of freedom $k$. We could write functional log-contrast regression model as $$ y = 1_n\beta_0 + Z_c\beta_c + Z\beta + e, s.t. \sum_{j=1}^{p}\beta_j=0_k, $$ where $Z$ is a n-by-pk matrix corresponding to the integral, $\beta=vec(B^T)$ is a pk-vector with every each k-subvector corresponding to the coefficient vector for the $j$-th compositional component.

To enable variable selection, FuncompCGL model is estimated via linearly constrained group lasso, $$ argmin_{\beta_0, \beta_c, \beta}(\frac{1}{2n}\|y - 1_n\beta_0 - Z_c\beta_c - Z\beta\|_2^2 + \lambda \sum_{j=1}^{p} \|\beta_j\|_2), s.t. \sum_{j=1}^{p} \beta_j = 0_k. $$

References

Sun, Z., Xu, W., Cong, X., Li G. and Chen K. (2020) Log-contrast regression with functional compositional predictors: linking preterm infant's gut microbiome trajectories to neurobehavioral outcome, https://arxiv.org/abs/1808.02403 Annals of Applied Statistics.

Yang, Y. and Zou, H. (2015) A fast unified algorithm for computing group-lasso penalized learning problems, https://link.springer.com/article/10.1007/s11222-014-9498-5 Statistics and Computing 25(6) 1129-1141.

Aitchison, J. and Bacon-Shone, J. (1984) Log-contrast models for experiments with mixtures, Biometrika 71 323-330.

Examples

Run this code

# NOT RUN {
df_beta = 5
p = 30
beta_C_true = matrix(0, nrow = p, ncol = df_beta)
beta_C_true[1, ] <- c(-0.5, -0.5, -0.5 , -1, -1)
beta_C_true[2, ] <- c(0.8, 0.8,  0.7,  0.6,  0.6)
beta_C_true[3, ] <- c(-0.8, -0.8 , 0.4 , 1 , 1)
beta_C_true[4, ] <- c(0.5, 0.5, -0.6  ,-0.6, -0.6)
Data <- Fcomp_Model(n = 50, p = p, m = 0, intercept = TRUE,
                    SNR = 4, sigma = 3, rho_X = 0, rho_T = 0.6, df_beta = df_beta,
                    n_T = 20, obs_spar = 1, theta.add = FALSE,
                    beta_C = as.vector(t(beta_C_true)))
m1 <- FuncompCGL(y = Data$data$y, X = Data$data$Comp, Zc = Data$data$Zc,
                 intercept = Data$data$intercept, k = df_beta, tol = 1e-10)
print(m1)
plot(m1)
beta <- coef(m1)
arg_list <- as.list(Data$call)[-1]
arg_list$n <- 30
TEST <- do.call(Fcomp_Model, arg_list)
y_hat <- predict(m1, Znew = TEST$data$Comp, Zcnew = TEST$data$Zc)
plot(y_hat[, floor(length(m1$lam)/2)], TEST$data$y,
     ylab = "Observed Response", xlab = "Predicted Response")

beta <- coef(m1, s = m1$lam[20])
beta_C <- matrix(beta[1:(p*df_beta)], nrow = p, byrow = TRUE)
colSums(beta_C)
Non.zero <- (1:p)[apply(beta_C, 1, function(x) max(abs(x)) > 0)]
Non.zero

# }

Run the code above in your browser using DataLab