core: Covariance Reduction

Description

Method to reduce sample covariance matrices to an informational core that is sufficient to characterize the variance heterogeneity among different populations.

Usage

core(X, y, Sigmas = NULL, ns = NULL, numdir = 2,
        numdir.test = FALSE, ...)

Arguments

Data matrix with n rows of observations and p columns of predictors. The predictors are assumed to have a continuous distribution.

Vector of group labels. Observations with the same label are considered to be in the same group.

Sigmas

A list object of sample covariance matrices corresponding to the different populations.

A vector of number of observations of the samples corresponding to the different populations.

numdir

Integer between 1 and p. It is the number of directions to estimate for the reduction.

numdir.test

Boolean. If FALSE, core computes the reduction for the specific number of directions numdir. If TRUE, it does the computation of the reduction for the numdir directions, from 0 to numd

...

Other arguments to pass to GrassmannOptim.

Value

This command returns a list object of class ldr. The output depends on the argument numdir.test. If numdir.test=TRUE, a list of matrices is provided corresponding to the numdir values (1 through numdir) for each of the parameters $\Gamma$, $\Sigma$, and $\Sigma_g$. Otherwise, a single list of matrices for a single value of numdir. A likelihood ratio test and information criteria are provided to estimate the dimension of the sufficient reduction when numdir.test=TRUE. The output of loglik, aic, bic, numpar are vectors with numdir elements if numdir.test=TRUE, and scalars otherwise. Following are the components returned:
GammahatEstimate of $\Gamma$.
SigmahatEstimate of overall $\Sigma$.
SigmashatEstimate of group-specific $\Sigma_g$'s.
loglikMaximized value of the CORE log-likelihood.
aicAkaike information criterion value.
bicBayesian information criterion value.
numparNumber of parameters in the model.

Details

Consider the problem of characterizing the covariance matrices $\Sigma_y, y=1,...,h$, of a random vector $X$ observed in each of $h$ normal populations. Let $S_y = (n_y-1)\tilde{\Sigma}_y$ where $\tilde{\Sigma}_y$ is the sample covariance matrix corresponding to $\Sigma_y$, and $n_y$ is the number of observations corresponding to $y$. The goal is to find a semi-orthogonal matrix $\Gamma \in R^{p \times d}, d < p$, with the property that for any two populations $j$ and $k$ $$S_j|(\Gamma' S_j \Gamma=B, n_j=m) \sim S_k|(\Gamma' S_k \Gamma=B, n_k=m).$$ That is, given $\Gamma' S_g \Gamma$ and $n_g$, the conditional distribution of $S_g$ must must depend on $g$. Thus $\Gamma' S_g \Gamma$ is sufficient to account for the heterogeneity among the population covariance matrices. The central subspace $\mathcal{S}$, spanned by the columns of $\Gamma$ is obtained by optimizing the following log-likelihood function $$L(\mathcal{S})= c-\frac{n}{2} \log|\tilde{\Sigma}| + \frac{n}{2} \log|P_{\mathcal{S}} \tilde{\Sigma} P_{\mathcal{S}}|-\sum_{y=1}^{h}\frac{n_y}{2} \log|P_{\mathcal{S}} \tilde{\Sigma}_y P_{\mathcal{S}}|,$$ where $c$ is a constant depending only on $p$ and $n_y$, $\tilde{\Sigma}_y, y=1,...,h,$ denotes the sample covariance matrix from population $y$ computed with divisor $n_y$, and $\tilde{\Sigma}=\sum_{y=1}^{h} (n_y/n)\tilde{\Sigma}$. The optimization is carried over $\mathcal{G}_{(d,p)}$, the set of all $d$-dimensional subspaces in $R^{p}$, called Grassmann manifold of dimension $d(p-d)$. The dimension $d$ is to be estimated. A sequential likelihood ratio test and information criteria (AIC, BIC) are implemented, following Cook and Forzani (2008).

References

Cook RD and Forzani L (2008). Covariance reducing models: An alternative to spectral modelling of covariance matrices. Biometrika, Vol. 95, No. 4, 799--812.

Examples

Run this code

data(flea)
fit1 <- core(X=flea[,-1], y=flea[,1], numdir.test=TRUE)
summary(fit1)

data(snakes)
fit2 <- ldr(Sigmas=snakes[-3], ns=snakes[[3]], numdir = 4, 
	model = "core", numdir.test = TRUE, verbose=TRUE, 
	sim_anneal = TRUE, max_iter = 200, max_iter_sa=200)
summary(fit2)

Run the code above in your browser using DataLab