Learn R Programming

LogisticDx (version 0.3)

dx: Diagnostics for binomial regression

Description

Returns diagnostic measures for a binary regression model by covariate pattern

Usage

dx(x, ...)

# S3 method for glm dx(x, ..., byCov = TRUE)

Arguments

x

A regression model with class glm and x$family$family == "binomial".

...

Additional arguments which can be passed to:

?stats::model.matrix

e.g. contrasts.arg which can be used for factor coding.

byCov

Return values by covariate pattern, rather than by individual observation.

Value

A data.table, with rows sorted by \(\Delta \hat{\beta}_i\). If byCov==TRUE, there is one row per covariate pattern with at least one observation. The initial columns give the predictor variables \(1 \ldots p\).

Subsequent columns are labelled as follows:

\(\mathrm{y} \quad y_i\)

The actual number of observations with \(y=1\) in the model data.

\(\mathrm{P} \quad P_i\)

Probability of this covariate pattern. This is given by the inverse of the link function, x$family$linkinv. See: ?stats::family

\(\mathrm{n} \quad n_i\)

Number of observations with these covariates. If byCov=FALSE then this will be \(=1\) for all observations.

\(\mathrm{yhat} \quad \hat{y}\)

The predicted number of observations having a response of \(y=1\), according to the model. This is: $$\hat{y_i} = n_i P_i$$

\(\mathrm{h} \quad h_i\)

Leverage, the diagonal of the hat matrix used to generate the model: $$H = \sqrt{V} X (X^T V X)^{-1} X^T \sqrt{V}$$ Here \(^{-1}\) is the inverse and \(^T\) is the transpose of a matrix. \(X\) is the matrix of predictors, given by stats::model.matrix. \(V\) is an \(N \times N\) sparse matrix. All elements are \(=0\) except for the diagonal, which is: $$v_{ii} = n_iP_i (1 - P_i)$$ Leverage \(H\) is also the estimated covariance matrix of \(\hat{\beta}\). Leverage is measure of the influence of this covariate pattern on the model and is approximately $$h_i \approx x_i - \bar{x} \quad \mathrm{for} \quad 0.1 < P_i < 0.9$$ That is, leverage is approximately equal to the distance of the covariate pattern \(i\) from the mean \(\bar{x}\). For values of \(p\) which are large (\(>0.9\)) or small (\(<0.1\)) this relationship no longer holds.

\(\mathrm{Pr} \quad Pr_i\)

The Pearson residual, a measure of influence. This is: $$Pr_i = \frac{y_i - \mu_y}{\sigma_y}$$ where \(\mu_y\) and \(\sigma_y\) refer to the mean and standard deviation of a binomial distribution. \(\sigma^2_y = Var_y\), is the variance. $$E(y=1) = \mu_y = \hat{y} = nP \quad \mathrm{and} \quad \sigma_y=\sqrt{nP(1 - P)}$$ Thus: $$Pr_i = \frac{y_i - n_i P_i}{\sqrt{n_i P_i (1 - P_i)}}$$

\(\mathrm{dr} \quad dr_i\)

The deviance residual, a measure of influence: $$dr_i = \mathrm{sign}(y_i - \hat{y}_i) \sqrt{d_i}$$ \(d_i\) is the contribution of observation \(i\) to the model deviance. The \(\mathrm{sign}\) above is:

  • \(y_i > \hat{y}_i \quad \rightarrow \mathrm{sign}(i)=1\)

  • \(y_i = \hat{y}_i \quad \rightarrow \mathrm{sign}(i)=0\)

  • \(y_i < \hat{y}_i \quad \rightarrow \mathrm{sign}(i)=-1\)

In logistic regression this is: $$y_i = 1 \quad \rightarrow \quad dr_i = \sqrt{2 \log (1 + \exp(f(x))) - f(x)}$$ $$y_i = 0 \quad \rightarrow \quad dr_i = -\sqrt{2 \log (1 + \exp(f(x)))}$$ where \(f(x)\) is the linear function of the predictors \(1 \ldots p\): $$f(x) = \hat{\beta_0} + \hat{\beta_1} x_{1i} + \ldots + \hat{\beta_p} x_{ip}$$ this is also: $$dr_i = sign(y_i - \hat{y_i}) \sqrt{ 2 (y_i \log{\frac{y_i}{\hat{y_i}}} + (n_i - y_i) \log{\frac{n_i - y_i}{n_i(1-p_i)}} )}$$ To avoid the problem of division by zero: $$y_i = 0 \quad \rightarrow \quad dr_i = - \sqrt{2n_i| \log{1 - P_i} | }$$ Similarly to avoid \(\log{\infty}\): $$y_i = n_i \quad \rightarrow \quad dr_i = \sqrt{2n_i | \log{P_i} | }$$ The above equations are used when calculating \(dr_i\) by covariate group.

\(\mathrm{sPr} \quad sPr_i\)

The standardized Pearson residual. The residual is standardized by the leverage \(h_i\): $$sPr_i = \frac{Pr_i}{\sqrt{(1 - h_i)}}$$

\(\mathrm{sdr} \quad sdr_i\)

The standardized deviance residual. The residual is standardized by the leverage, as above: $$sdr_i = \frac{dr_i}{\sqrt{(1 - h_i)}}$$

\(\mathrm{dChisq \quad \Delta P\chi^2_i}\)

The change in the Pearson chi-square statistic with observation \(i\) removed. Given by: $$\Delta P\chi^2_i = sPr_i^2 = \frac{Pr_i^2}{1 - h_i}$$ where \(sPr_i\) is the standardized Pearson residual, \(Pr_i\) is the Pearson residual and \(h_i\) is the leverage. \(\Delta P\chi^2_i\) should be \(<4\) if the observation has little influence on the model.

\(\Delta D_i \quad \mathrm{dDev}\)

The change in the deviance statistic \(D = \sum_{i=1}^n dr_i\) with observation \(i\) excluded. It is scaled by the leverage \(h_i\) as above: $$\Delta D_i = sdr_i^2 = \frac{dr_i^2}{1 - h_i}$$

\(\Delta \hat{\beta}_i \quad \mathrm{dBhat}\)

The change in \(\hat{\beta}\) with observation \(i\) excluded. This is scaled by the leverage as above: $$\Delta \hat{\beta} = \frac{sPr_i^2 h_i}{1 - h_i}$$ where \(sPr_i\) is the standardized Pearson residual. \(\Delta \hat{\beta}_i\) should be \(<1\) if the observation has little influence on the model coefficients.

See Also

plot.glm

Examples

Run this code
# NOT RUN {
## H&L 2nd ed. Table 5.8. Page 182.
## Pattern nos. 31, 477, 468
data(uis)
uis <- within(uis, {
    NDRGFP1 <- 10 / (NDRGTX + 1)
    NDRGFP2 <- NDRGFP1 * log((NDRGTX + 1) / 10)
})
(d1 <- dx(g1 <- glm(DFREE ~ AGE + NDRGFP1 + NDRGFP2 + IVHX +
                    RACE + TREAT + SITE +
                    AGE:NDRGFP1 + RACE:SITE,
                    family=binomial, data=uis)))
d1[519:521, ]
# }

Run the code above in your browser using DataLab