ncMcTest: Perform a Monte Carlo test for the number of components in a mixture of regressions.

Description

Produces nsim simulated realizations of the likelihood ratio statistic, either parametrically or semi-parametrically, and calculates the corresponding \(p\)-value of the test.

Usage

ncMcTest(x, y, data=NULL, ncomp=2, ncincr=1, intercept=TRUE, nsim=99,
         seed=NULL, ts1=NULL, ts2=NULL, semiPar=FALSE,
         conditional=semiPar, verb=FALSE, progRep=TRUE, …)

Arguments

Either a formula specifying the regression model to be fitted or a predictor or matrix of predictors for the regression model. If x is a matrix it should NOT include an initial column of 1s. If an intercept is desired then the intercept argument should be left equal to TRUE.

The vector of responses for the regression models. Ignored if x is a formula.

data

A list or data frame containing the variables in the regression model. Such variables will be looked for first in data and then in the global environment.

ncomp

The null-hypothesized number of components in the mixture.

ncincr

The increment from the null-hypothesized number of components in the mixture to the number under the alternative hypothesis; i.e. the number of components under the alternative hypothesis is ncomp + ncincr.

intercept

Logical argument indicating whether the regression models in the mixture should have intercept terms. Ignored if x is a formula.

nsim

The number of simulated replicates of the log likelihood ratio statistic to be produced.

seed

Positive integer scalar. The seed for random number generation. If left NULL it is obtained by sampling from 1:1e5.

ts1

Starting values for fitting the ncomp component model. If ts1 is null, random starting values are used. (This is not recommended.)

ts2

Starting values for fitting the ncomp+nincr component model. If ts2 is null, random starting values are used. (This is not recommended.)

semiPar

Logical scalar; should semi-parametric bootstrapping should be used?

conditional

Logical scalar; should the component-selection probabilities be determined conditionally upon the observations?

verb

Logical argument indicating whether the fitting processes should be verbose (i.e. whether details should be printed out at each step of the EM algorithm). If TRUE a huge amount of screen output is produced.

progRep

Logical argument indicating whether the index, of the simulated statistic just constructed, should be printed out, to give an idea of how the process is progressing.

...

Further arguments to be passed to mixreg() to control the fitting procedure.

Value

A list with components:

lrs

the likelihood ratio statistic for the test

pval

the (Monte Carlo) \(p\)-value of the test

simStats

a vector of the values of the likelihood ratio statistics of the simulated data sets

aic.ncomp

a vector of the aic values for the ncomp models fitted to the simulated data sets

aic.ncomp+nincr

a vector of the aic values for the ncomp+nincr models fitted to the simulated data sets. (Note that in any given instance ncomp and nincr are replaced by the actual numeric values that are used in that instance.)

the degrees of freedom that would be appropriate if the test statistic actually had a chi-squared distribution. Explicitly df is equal to the number of parameters in the alternative model minus the number of parameters in the null model

screwUps

a data frame with columns seed, i and type, containing respectively the random number generator seed, the index and the type of each screw-up. Note that if there were in fact no screw-ups then the screwUps entry of the returned value will not be present. Note also that it is possible for values in the i column of screwUps to be repeated. The entries of the type column take values in the set {1,2,3,4,5}. These values have the interpretation:

1: singularity in the likelihood surface for the ncomp model
2: the algorithm failed to converge when fitting the ncomp component model
3: singularity in the likelihood surface for the ncomp+nincr model
4: the algorithm converged for the ncomp component model but failed to converge for the ncomp + nincr model
5: the likelihood ratio statistic for the ncomp model was greater than that for the ncomp+nincr model (which is theoretically impossible)

Note that if a screw-up does occur, the replicate is redone completely so that the returned value contains results for a full nsim simulations.

The returned value has an attribute "seed" which is the (initial) value of the random number generation seed that was used. This is either the value of the argument seed, or, if this was NULL, a randomly generated value.

Details

In this context the “parametric” procedure is to simulate data sets by generating data from the fitted ncomp model parameters, using Gaussian errors. In contrast, under the semiparametric bootstrapping procedure, the errors are generated by resampling from the residuals. Since at each predictor value there are ncomp residuals, one for each component of the model, the errors are selected at random from these ncomp possibilities. If the argument conditional is TRUE then the selection probabilities at this step are the conditional probabilities, of the observation being generated by each component of the model, given that observation. If conditional is FALSE then these probabilities are the corresponding entries of lambda (see Value. The residuals are sampled independently in either case. The procedure is termed semi-parametric since the sampling probabilities depend on the parameters of the model. Note that it makes no sense to specify conditional=TRUE if semiPar is FALSE. Doing so will generate an error.

It is important to be aware that the test conducted by this function is a Monte Carlo test and that the \(p\)-value produced is a Monte Carlo \(p\)-value. It is consequently an exact \(p\)-value in a sense which must be carefully understood. See for example Baddeley et al. 2015 (section 10.6) and Turner and Jeffs 2017 for explanation of the interpretation of Monte Carlo \(p\)-values and for some general discussion of Monte Carlo tests and of their advantages. Such tests effect substantial savings on computational costs with only marginal diminishment of power.

References

T. Rolf Turner (2000). Estimating the rate of spread of a viral infection of potato plants via mixtures of regressions. Applied Statistics 49 Part 3, pp. 371 -- 384.

Adrian Baddeley, Ege Rubak and Rolf Turner (2015). Spatial Point Patterns: Methodology and Applications with R. London: Chapman and Hall/CRC Press.

Rolf Turner and Celeste Jeffs (2017). A note on exact Monte Carlo hypothesis tests. Communications in Statistics: Simulation and Computation 46, pp. 6545 -- 6558.

Examples

Run this code

# NOT RUN {
    
# }
# NOT RUN {
       tst12 <- ncMcTest(plntsInf ~ aphRel,ncomp=1,data=aphids,seed=42)
    
# }
# NOT RUN {
 # Monte Carlo p-value is 0.01; mixture model is called for.
    TS1 <- list(list(beta=c(3.0,0.1),sigsq=16,lambda=0.5),
                list(beta=c(0.0,0.0),sigsq=16,lambda=0.5))
    TS2 <- list(list(beta=c(3.0,0.1),sigsq=9,lambda=1/3),
                list(beta=c(1.5,0.05),sigsq=9,lambda=1/3),
                list(beta=c(0.0,0.0),sigsq=9,lambda=1/3))
    x <- aphids$aphRel
    y <- aphids$plntsInf
    
# }
# NOT RUN {
      nsim <- 999
    
# }
# NOT RUN {
    
# }
# NOT RUN {
    tst23 <- ncMcTest(x,y,nsim=nsim,ts1=TS1,ts2=TS2)
# }

Run the code above in your browser using DataLab