fit.full.GMCM: Estimate GMCM parameters of the general model

Description

Estimates the parameters of general Gaussian mixture copula models (GMCM). The function finds the maximum likelihood estimate of a general GMCM with various optimization procedures. Note, all but the PEM methods provides the maximum likelihood estimate.

Usage

fit.full.GMCM(u, m, theta = choose.theta(u, m), method = c("NM",
  "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE,
  ...)
fit.general.GMCM(u, m, theta = choose.theta(u, m), method = c("NM",
  "SANN", "L-BFGS", "L-BFGS-B", "PEM"), max.ite = 1000, verbose = TRUE,
  ...)

Arguments

An n by d matrix of marginally uniform observations. Rows corresponds to observations and columns to the dimensions of the variables. I.e. these are often ranked and scaled test statistics or other observations.

The number of components to be fitted.

theta

A list of parameters as defined in rtheta. If theta is not provided, then heuristic starting values are chosen using the k-means algorithm.

method

A character vector of length \(1\). The optimization method used. Should be either "NM", "SANN", "L-BFGS", "L-BFGS-B", or "PEM" which are the Nelder-Mead, Simulated Annealing, limited-memory quasi-Newton method, limited-memory quasi-Newton method with box constraints, and the pseudo EM algorithm, respectively. Default is "NM". See optim for further details.

max.ite

The maximum number of iterations. If the method is "SANN" this is the number of iterations as there is no other stopping criterion. (See optim)

verbose

Logical. If TRUE, a trace of the parameter estimates is made.

…

Arguments passed to the control-list in optim when method is not equal to "PEM". If method equals "PEM", the arguments are passed to PseudoEMAlgorithm if the method.

Value

A list of parameters formatted as described in rtheta.

When method equals "PEM", a list of extra information (log-likelihood trace, the matrix of group probabilities, theta trace) is added as an attribute called "extra".

Details

The "L-BFGS-B" method does not perform a transformation of the parameters and uses box constraints as implemented in optim. Note that the many parameter configurations are poorly estimable or directly unidentifiable.

fit.general.GMCM is simply an alias of fit.full.gmcm.

References

Li, Q., Brown, J. B. J. B., Huang, H., & Bickel, P. J. (2011). Measuring reproducibility of high-throughput experiments. The Annals of Applied Statistics, 5(3), 1752-1779. doi:10.1214/11-AOAS466

Tewari, A., Giering, M. J., & Raghunathan, A. (2011). Parametric Characterization of Multimodal Distributions with Non-gaussian Modes. 2011 IEEE 11th International Conference on Data Mining Workshops, 286-292. doi:10.1109/ICDMW.2011.135

Examples

Run this code

# NOT RUN {
set.seed(17)
sim <- SimulateGMCMData(n = 1000, m = 3, d = 2)

# Plotting simulated data
par(mfrow = c(1,2))
plot(sim$z, col = rainbow(3)[sim$K], main = "Latent process")
plot(sim$u, col = rainbow(3)[sim$K], main = "GMCM process")

# Observed data
uhat <- Uhat(sim$u)

# The model should be fitted multiple times using different starting estimates
start.theta <- choose.theta(uhat, m = 3)  # Random starting estimate
res <- fit.full.GMCM(u = uhat, theta = start.theta,
                     method = "NM", max.ite = 3000,
                     reltol = 1e-2, trace = TRUE)  # Note, 1e-2 is too big

# Confusion matrix
Khat <- apply(get.prob(uhat, theta = res), 1, which.max)
table("Khat" = Khat, "K" = sim$K)  # Note, some components have been swapped

# Simulation from GMCM with the fitted parameters
simfit <- SimulateGMCMData(n = 1000, theta = res)

# As seen, the underlying latent process is hard to estimate.
# The clustering, however, is very good.
par(mfrow = c(2,2))
plot(simfit$z, col = simfit$K, main = "Model check 1\nSimulated GMM")
plot(simfit$u, col = simfit$K, main = "Model check 2\nSimulated GMCM")
plot(sim$u, col = Khat, main = "MAP clustering")
# }

Run the code above in your browser using DataLab