Learn R Programming

robustHD (version 0.8.1)

diagnosticPlot: Diagnostic plots for a sequence of regression models

Description

Produce diagnostic plots for a sequence of regression models, such as submodels along a robust least angle regression sequence, or sparse least trimmed squares regression models for a grid of values for the penalty parameter. Four plots are currently implemented.

Usage

diagnosticPlot(object, ...)

# S3 method for seqModel diagnosticPlot(object, s = NA, covArgs = list(), ...)

# S3 method for perrySeqModel diagnosticPlot(object, covArgs = list(), ...)

# S3 method for tslars diagnosticPlot(object, p, s = NA, covArgs = list(), ...)

# S3 method for sparseLTS diagnosticPlot( object, s = NA, fit = c("reweighted", "raw", "both"), covArgs = list(), ... )

# S3 method for perrySparseLTS diagnosticPlot( object, fit = c("reweighted", "raw", "both"), covArgs = list(), ... )

# S3 method for setupDiagnosticPlot diagnosticPlot( object, which = c("all", "rqq", "rindex", "rfit", "rdiag"), ask = (which == "all"), facets = object$facets, size = c(2, 4), id.n = NULL, ... )

Value

If only one plot is requested, an object of class "ggplot" (see ggplot), otherwise a list of such objects.

Arguments

object

the model fit for which to produce diagnostic plots, or an object containing all necessary information for plotting (as generated by setupDiagnosticPlot).

...

additional arguments to be passed down, eventually to geom_point.

s

for the "seqModel" method, an integer vector giving the steps of the submodels for which to produce diagnostic plots (the default is to use the optimal submodel). For the "sparseLTS" method, an integer vector giving the indices of the models for which to produce diagnostic plots (the default is to use the optimal model for each of the requested fits).

covArgs

a list of arguments to be passed to covMcd for the regression diagnostic plot (see

p

an integer giving the lag length for which to produce the plot (the default is to use the optimal lag length).

fit

a character string specifying for which fit to produce diagnostic plots. Possible values are "reweighted" (the default) for diagnostic plots for the reweighted fit, "raw" for diagnostic plots for the raw fit, or "both" for diagnostic plots for both fits. “Details”).

which

a character string indicating which plot to show. Possible values are "all" (the default) for all of the following, "rqq" for a normal Q-Q plot of the standardized residuals, "rindex" for a plot of the standardized residuals versus their index, "rfit" for a plot of the standardized residuals versus the fitted values, or "rdiag" for a regression diagnostic plot (standardized residuals versus robust Mahalanobis distances of the predictor variables).

ask

a logical indicating whether the user should be asked before each plot (see devAskNewPage). The default is to ask if all plots are requested and not ask otherwise.

facets

a faceting formula to override the default behavior. If supplied, facet_wrap or facet_grid is called depending on whether the formula is one-sided or two-sided.

size

a numeric vector of length two giving the point and label size, respectively.

id.n

an integer giving the number of the most extreme observations to be identified by a label. The default is to use the number of identified outliers, which can be different for the different plots. See “Details” for more information.

Author

Andreas Alfons

Details

In the normal Q-Q plot of the standardized residuals, a reference line is drawn through the first and third quartile. The id.n observations with the largest distances from that line are identified by a label (the observation number). The default for id.n is the number of regression outliers, i.e., the number of observations whose residuals are too large (cf. weights).

In the plots of the standardized residuals versus their index or the fitted values, horizontal reference lines are drawn at 0 and +/-2.5. The id.n observations with the largest absolute values of the standardized residuals are identified by a label (the observation number). The default for id.n is the number of regression outliers, i.e., the number of observations whose absolute residuals are too large (cf. weights).

For the regression diagnostic plot, the robust Mahalanobis distances of the predictor variables are computed via the minimum covariance determinant (MCD) estimator based on only those predictors with non-zero coefficients (see covMcd). Horizontal reference lines are drawn at +/-2.5 and a vertical reference line is drawn at the upper 97.5% quantile of the \(\chi^{2}\) distribution with \(p\) degrees of freedom, where \(p\) denotes the number of predictors with non-zero coefficients. The id.n observations with the largest absolute values of the standardized residuals and/or largest robust Mahalanobis distances are identified by a label (the observation number). The default for id.n is the number of all outliers: regression outliers (i.e., observations whose absolute residuals are too large, cf. weights) and leverage points (i.e., observations with robust Mahalanobis distance larger than the 97.5% quantile of the \(\chi^{2}\) distribution with \(p\) degrees of freedom).

Note that the argument alpha for controlling the subset size behaves differently for sparseLTS than for covMcd. For sparseLTS, the subset size \(h\) is determined by the fraction alpha of the number of observations \(n\). For covMcd, on the other hand, the subset size also depends on the number of variables \(p\) (see h.alpha.n). However, the "sparseLTS" and "perrySparseLTS" methods attempt to compute the MCD using the same subset size that is used to compute the sparse least trimmed squares regressions. This may not be possible if the number of selected variables is large compared to the number of observations. In such cases, setupDiagnosticPlot returns NAs for the robust Mahalanobis distances, and the regression diagnostic plot fails.

See Also

ggplot, rlars, grplars, rgrplars, tslarsP, rtslarsP, tslars, rtslars, sparseLTS, plot.lts

Examples

Run this code
## generate data
# example is not high-dimensional to keep computation time low
library("mvtnorm")
set.seed(1234)  # for reproducibility
n <- 100  # number of observations
p <- 25   # number of variables
beta <- rep.int(c(1, 0), c(5, p-5))  # coefficients
sigma <- 0.5      # controls signal-to-noise ratio
epsilon <- 0.1    # contamination level
Sigma <- 0.5^t(sapply(1:p, function(i, j) abs(i-j), 1:p))
x <- rmvnorm(n, sigma=Sigma)    # predictor matrix
e <- rnorm(n)                   # error terms
i <- 1:ceiling(epsilon*n)       # observations to be contaminated
e[i] <- e[i] + 5                # vertical outliers
y <- c(x %*% beta + sigma * e)  # response
x[i,] <- x[i,] + 5              # bad leverage points


## robust LARS
# fit model
fitRlars <- rlars(x, y, sMax = 10)
# create plot
diagnosticPlot(fitRlars)


## sparse LTS
# fit model
fitSparseLTS <- sparseLTS(x, y, lambda = 0.05, mode = "fraction")
# create plot
diagnosticPlot(fitSparseLTS)
diagnosticPlot(fitSparseLTS, fit = "both")

Run the code above in your browser using DataLab