Learn R Programming

mixreg (version 2.0-10)

mixreg: Fit a mixture of linear regressions.

Description

Estimates the parameters for a mixture of linear regressions, assuming Gaussian errors, using the EM algorithm.

Usage

mixreg(x, y, ncomp = NULL, intercept = TRUE, eqVar = FALSE,
       thetaStart = NULL, itmax = 1000,eps = 1e-06, verb = TRUE,
       digits = 7, maxTry = 5, seed = NULL, data = NULL,
       covMat=FALSE,MC=FALSE,warn=TRUE,...)

Arguments

x

Either a formula specifying the regression model in question or a matrix of predictors for that regression model. In the latter case x should NOT include an initial column of 1s. (If an intercept is required, leave intercept as TRUE!) If there is only one predictor, x may be a vector rather than a one-column matrix.

y

The vector of responses for the regression models. Not used and should not be specified if x is a formula.

ncomp

The number of components in the mixture. If thetaStart (see below) is supplied, then the number of components is determined from this argument and ncomp is ignored. If neither ncomp nor thetaStart is supplied then ncomp defaults to 2.

intercept

Logical scalar; should the linear regressions should have intercepts fitted? Ignored if x is a formula.

eqVar

Logical scalar; should the error variance should be the same for all component? (The alternative is that each component should be allowed a different error variance.)

thetaStart

Either a list or a matrix providing starting values for the estimation procedure. If it is a list each of its entries is in turn a list with entries beta (vector of linear coefficients), sigsq (variance) and lambda (mixing probability). If it is a matrix it must have ncomp rows and ncoef + 2 columns, where ncoef is the number of regression coefficients in the model. The latter quantity is ncol(x) if intercept is FALSE and is 1 + ncol(x) if intercept is TRUE.

If eqVar is TRUE then it is sensible to have all the starting values of sigsq equal, but this is not strictly necessary. If thetaStart is not specified, starting values are produced “randomly”. This sometimes works, but sometimes doesn't. The function visualFit() provides a convenient means of determining starting values by a “visual” procedure. Note that if object is the value returned by visualFit() then the appropriate value for thetaStart is object$theta.

itmax

The maximum number of EM steps to be undertaken. If this maximum number of steps is exceeded then a warning is issued, but a fit (probably unreliable) is returned anyway. The returned value will, in this case, have the converged component equal to FALSE.

eps

A value specifying the convergence criterion for the EM algorithm. If the maximum absolute value of the change in the parameters is less than eps the algorithm is considered to have converged.

verb

Logical argument; if verb is TRUE then details of the progress of the algorithm are printed out at each EM step.

digits

The number of digits to which the details are printed out, when verb is TRUE.`

maxTry

If the algorithm encounters a singularity in the likelihood (as may possibly occur when eqVar is FALSE) the algorithm is restarted using new (randomly generated) starting values. The restart is attempted a maximum of maxTry times.

seed

A numeric scalar used to seed the random number generators if thetaStart is not specified (whence starting values are produced randomly). If seed is not an integer it gets rounded to the nearest integer (so seed=pi has the same impact as seed=3. If not supplied, seed is selected randomly from 1, 2, …, 1e5. The seed for random number generation is set to seed in initRand() before any random number generation is done. The argument seed is ignored if thetaStart is supplied.

data

A list or data frame in which the values of the data x and y may be stored. If data is not NULL then mixreg() looks for x and y first in data and then in the global environment.

covMat

Logical scalar; should the covariance matrix of the parameter estimates be calculated? (This can take some time.)

MC

Logical scalar; should the covariance matrix of the parameter estimates be calculated by a Monte Carlo procedure? Ignored if covMat is FALSE.

warn

Logical scalar. Should a warning be issued if the EM algorithm fails to converge in itmax iterations? Note that this failure is discernible anyway, from the converged component of the returned value, but it's probably best to be alerted to the failure. This argument is present mainly so that it can be set to FALSE internally in ncMcTest(), where such warnings are redundant.

...

Optional arguments which are passed to covMixMC() (if covMat and MC are both TRUE. These optional arguments may be semiPar, conditional or cMseed. (Any other argument being given to mixreg() will trigger an error.)

Value

A list, of class mixreg, with components

parmat

The parameters of the fitted model arranged as a matrix, each row corresponding to one component of the mixture.

theta

The parameters of the fitted model as a list, each entry of the list being itself a list (like those in thetaStart) corresponding to one component of the mixture.

log.like

The log likelihood of the fitted model, based on Gaussian errors.

aic

The Akaike Information Criterion value for the fitted model; aic is equal to -2 * log.like + 2*M where M is the number of parameters in the model.

intercept

The value of the intercept argument.

eqVar

The value of the eqVar argument.

nsteps

The number of steps the EM algorithm took to converge.

converged

Logical scalar indicating whether the algorithm did indeed `converge or stopped because it reach the itmax EM step.

data

A list or data frame providing the data x and y to which the model was fitted. It may be equal to the value of the data argument, or it may have been constructed, in whole or in part, from the x and y arguments.

formula

The formula used by lm() in fitting the regression models. Depending on circumstances it may be equal to the argument fmla, or it may have been constructed internally by mixreg(), in which case it is of the form y ~ x or, if intercept is FALSE, y ~ x - 1 (with, of course, x and y replaced by the actual names of the objects provided as the corresponding arguments).

The returned value has an attribute "seed" specifying the seed for the random number generators that was used in producing random starting values. If thetaStart was specified then the attribute "seed" is NA.

Warning

If the x argument is a formula, then specifying y is not only meaningless, but will cause a (possibly mystifying) error to be thrown.

Notes

This function (and indeed the entire mixreg package) is structured so as to be able to deal with mixtures of regressions involving any number of predictors. However I know of no practical examples in which more than one predictor is involved, and it seems likely that models involving more than one predictor would present substantial difficulties. Some of the functions in this package (cband(), visualFit(), plot.mixreg(), …) can cope only with single-predictor models.

The entries parmat and theta of the returned value contain the same information, presented in a different format.

Even if eqVar is TRUE, each entry of theta still has its own sigsq entry. The values of these will all be equal, however, if eqVar is TRUE.

For scalar mixtures, allowing the components to have different variances can have the effect of introducing singularities in the likelihood function. It is not clear to me what the impact of allowing different variances is in the mixtures-of-regressions setting. In respect of the scalar setting, Aitkin and Tunnicliffe Wilson (see References) assert that the singularities that may arise “do not cause any computational difficulty in practice”. However in the mixtures of regressions setting, I have observed strange anomalies which appear to be alleviated by setting eqVar=TRUE. See Examples. If results seem to be unsatisfactory, you may be well-advised to try setting eqVar=TRUE, to see if that makes an improvement.

References

T. Rolf Turner (2000). Estimating the rate of spread of a viral infection of potato plants via mixtures of regressions. Applied Statistics 49 Part 3, pp. 371 -- 384.

A. P. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39, pp. 1 -- 22.

M. Aitkin and G. Tunnicliffe Wilson (1980). Mixture models, outliers, and the EM algorithm. Technometrics 22, pp. 325 -- 331.

See Also

ncMcTest(), cband(), qqMix(), residuals.mixreg()

Examples

Run this code
# NOT RUN {
# Aphids.
    fit1   <- mixreg(aphRel,plntsInf,ncomp=2,seed=42,data=aphids)
    plot(fit1)
    thStrt <- list(list(beta=c(3.0,0.1),sigsq=16,lambda=0.5),
                   list(beta=c(0.0,0.0),sigsq=16,lambda=0.5))
    fit2   <- mixreg(aphRel,plntsInf,ncomp=2,thetaStart=thStrt,data=aphids)
    fit3   <- mixreg(plntsInf ~ aphRel,ncomp=2,thetaStart=thStrt,data=aphids)
# Kilns.
# }
# NOT RUN {
    vfit <- visualFit(y ~ x,ncomp=3, data=kilnAoneOut)
    fit  <- mixreg(y ~ x,ncomp=3,data=kilnAoneOut,thetaStart=vfit$theta)
# }
# NOT RUN {
    thStrt <- list(
                   list(beta=c(26.07,48808),sigsq=1.1573,lambda=0.33333333),
                   list(beta=c(23.48,32387),sigsq=1.8730,lambda=0.33333333),
                   list(beta=c(-0.0597,20760),sigsq=0.2478,lambda=0.33333333)
                 )
    # Roughly the value of vfit$theta.
    fit  <- mixreg(y ~ x,ncomp=3,data=kilnAoneOut,thetaStart=thStrt)
    plot(fit)
# Kilns, zero intercept model.
# }
# NOT RUN {
    vfit <- visualFit(y ~ x - 1,ncomp=3, data=kilnAoneOut)
    fit  <- mixreg(y ~ x - 1,ncomp=3,data=kilnAoneOut,thetaStart=vfit$theta)
# }
# NOT RUN {
    thStrt <- list(list(beta=50900,sigsq=3.297,lambda=0.33333333),
                  list(beta=33800,sigsq=25.52,lambda=0.33333333),
                  list(beta=20755,sigsq=0.2477,lambda=0.33333333))
    # Roughly the value of vfit$theta.
    fit  <- mixreg(y ~ x - 1,ncomp=3,data=kilnAoneOut,thetaStart=thStrt)
    plot(fit) # Yikes!!!  (But a plot of vfit looks practically perfect.)
    fit  <- mixreg(y ~ x - 1,ncomp=3,data=kilnAoneOut,thetaStart=thStrt,eqVar=TRUE)
    plot(fit) # Looks fine.
# }

Run the code above in your browser using DataLab