dx: Diagnostics for binomial regression

Description

Returns diagnostic measures for a binary regression model by covariate pattern

Usage

dx(x, ...)
# S3 method for glm
dx(x, ..., byCov = TRUE)

Arguments

A regression model with class glm and x$family$family == "binomial".

...

Additional arguments which can be passed to:

?stats::model.matrix

e.g. contrasts.arg which can be used for factor coding.

byCov

Return values by covariate pattern, rather than by individual observation.

Value

A data.table, with rows sorted by $\Delta \hat{\beta}_i$. If byCov==TRUE, there is one row per covariate pattern with at least one observation. The initial columns give the predictor variables $1 \ldots p$.

Subsequent columns are labelled as follows:

$\mathrm{y} \quad y_i$

The actual number of observations with $y=1$ in the model data.

$\mathrm{P} \quad P_i$

Probability of this covariate pattern. This is given by the inverse of the link function, x$family$linkinv. See: ?stats::family

$\mathrm{n} \quad n_i$

Number of observations with these covariates. If byCov=FALSE then this will be $=1$ for all observations.

$\mathrm{yhat} \quad \hat{y}$

The predicted number of observations having a response of $y=1$, according to the model. This is: $$\hat{y_i} = n_i P_i$$

$\mathrm{h} \quad h_i$

Leverage, the diagonal of the hat matrix used to generate the model: $$H = \sqrt{V} X (X^T V X)^{-1} X^T \sqrt{V}$$ Here $^{-1}$ is the inverse and $^T$ is the transpose of a matrix. $X$ is the matrix of predictors, given by stats::model.matrix. $V$ is an $N \times N$ sparse matrix. All elements are $=0$ except for the diagonal, which is: $$v_{ii} = n_iP_i (1 - P_i)$$ Leverage $H$ is also the estimated covariance matrix of $\hat{\beta}$. Leverage is measure of the influence of this covariate pattern on the model and is approximately $$h_i \approx x_i - \bar{x} \quad \mathrm{for} \quad 0.1 < P_i < 0.9$$ That is, leverage is approximately equal to the distance of the covariate pattern $i$ from the mean $\bar{x}$. For values of $p$ which are large ($>0.9$) or small ($<0.1$) this relationship no longer holds.

$\mathrm{Pr} \quad Pr_i$

The Pearson residual, a measure of influence. This is: $$Pr_i = \frac{y_i - \mu_y}{\sigma_y}$$ where $\mu_y$ and $\sigma_y$ refer to the mean and standard deviation of a binomial distribution. $\sigma^2_y = Var_y$, is the variance. $$E(y=1) = \mu_y = \hat{y} = nP \quad \mathrm{and} \quad \sigma_y=\sqrt{nP(1 - P)}$$ Thus: $$Pr_i = \frac{y_i - n_i P_i}{\sqrt{n_i P_i (1 - P_i)}}$$

$\mathrm{dr} \quad dr_i$

The deviance residual, a measure of influence: $$dr_i = \mathrm{sign}(y_i - \hat{y}_i) \sqrt{d_i}$$ $d_i$ is the contribution of observation $i$ to the model deviance. The $\mathrm{sign}$ above is:

$y_i > \hat{y}_i \quad \rightarrow \mathrm{sign}(i)=1$
$y_i = \hat{y}_i \quad \rightarrow \mathrm{sign}(i)=0$
$y_i < \hat{y}_i \quad \rightarrow \mathrm{sign}(i)=-1$

In logistic regression this is: $$y_i = 1 \quad \rightarrow \quad dr_i = \sqrt{2 \log (1 + \exp(f(x))) - f(x)}$$ $$y_i = 0 \quad \rightarrow \quad dr_i = -\sqrt{2 \log (1 + \exp(f(x)))}$$ where $f(x)$ is the linear function of the predictors $1 \ldots p$: $$f(x) = \hat{\beta_0} + \hat{\beta_1} x_{1i} + \ldots + \hat{\beta_p} x_{ip}$$ this is also: $$dr_i = sign(y_i - \hat{y_i}) \sqrt{ 2 (y_i \log{\frac{y_i}{\hat{y_i}}} + (n_i - y_i) \log{\frac{n_i - y_i}{n_i(1-p_i)}} )}$$ To avoid the problem of division by zero: $$y_i = 0 \quad \rightarrow \quad dr_i = - \sqrt{2n_i| \log{1 - P_i} | }$$ Similarly to avoid $\log{\infty}$: $$y_i = n_i \quad \rightarrow \quad dr_i = \sqrt{2n_i | \log{P_i} | }$$ The above equations are used when calculating $dr_i$ by covariate group.

$\mathrm{sPr} \quad sPr_i$

The standardized Pearson residual. The residual is standardized by the leverage $h_i$: $$sPr_i = \frac{Pr_i}{\sqrt{(1 - h_i)}}$$

$\mathrm{sdr} \quad sdr_i$

The standardized deviance residual. The residual is standardized by the leverage, as above: $$sdr_i = \frac{dr_i}{\sqrt{(1 - h_i)}}$$

$\mathrm{dChisq \quad \Delta P\chi^2_i}$

The change in the Pearson chi-square statistic with observation $i$ removed. Given by: $$\Delta P\chi^2_i = sPr_i^2 = \frac{Pr_i^2}{1 - h_i}$$ where $sPr_i$ is the standardized Pearson residual, $Pr_i$ is the Pearson residual and $h_i$ is the leverage. $\Delta P\chi^2_i$ should be $<4$ if the observation has little influence on the model.

$\Delta D_i \quad \mathrm{dDev}$

The change in the deviance statistic $D = \sum_{i=1}^n dr_i$ with observation $i$ excluded. It is scaled by the leverage $h_i$ as above: $$\Delta D_i = sdr_i^2 = \frac{dr_i^2}{1 - h_i}$$

$\Delta \hat{\beta}_i \quad \mathrm{dBhat}$

The change in $\hat{\beta}$ with observation $i$ excluded. This is scaled by the leverage as above: $$\Delta \hat{\beta} = \frac{sPr_i^2 h_i}{1 - h_i}$$ where $sPr_i$ is the standardized Pearson residual. $\Delta \hat{\beta}_i$ should be $<1$ if the observation has little influence on the model coefficients.

Examples

Run this code

# NOT RUN {
## H&L 2nd ed. Table 5.8. Page 182.
## Pattern nos. 31, 477, 468
data(uis)
uis <- within(uis, {
    NDRGFP1 <- 10 / (NDRGTX + 1)
    NDRGFP2 <- NDRGFP1 * log((NDRGTX + 1) / 10)
})
(d1 <- dx(g1 <- glm(DFREE ~ AGE + NDRGFP1 + NDRGFP2 + IVHX +
                    RACE + TREAT + SITE +
                    AGE:NDRGFP1 + RACE:SITE,
                    family=binomial, data=uis)))
d1[519:521, ]
# }

Run the code above in your browser using DataLab

Description

Usage

Arguments

Value

See Also

Examples