ss: Sample size for a given coefficient and events per covariate for model

Description

Sample size for a given coefficient and events per covariate for model

Usage

ss(x, ...)
# S3 method for glm
ss(
  x,
  ...,
  alpha = 0.05,
  beta = 0.8,
  coeff = names(stats::coef(x))[2],
  std = FALSE,
  alternative = c("one.sided", "two.sided"),
  OR = NULL,
  Px0 = NULL
)

Arguments

A regression model with class glm and x$family$family == "binomial".

...

Not used.

alpha

significance level $\alpha$ for the null-hypothesis significance test.

beta

power $\beta$ for the null-hypothesis significance test.

coeff

Name of coefficient (variable) in the model to be tested.

std

Standardize the coefficient?

If std=TRUE (the default), a continuous coefficent will be standardized, using the mean $\bar{x}$ and standard deviation $\sigma_x$: $$z_x = \frac{x_i - \bar{x}}{\sigma_x}$$

alternative

The default, alternative="one.sided", checks the null hypothesis with z = 1 - alpha.

If alternative="two.sided", z = 1 - alpha/2 is used instead.

Odds ratio. The size of the change in the probability.

Px0

The probability that $x=0$.

If not supplied, this is estimated from the data.

Value

A list of:

Sample size required to show coefficient for predictor is as given in the model rather than the alternative (by default $=0$).

epc

Events per covariate; should be $>10$ to make meaningful statements about the coefficients obtained.

Details

Gives the sample size necessary to demonstrate that a coefficient in the model for the given predictor is equal to its given value rather than equal to zero (or, if OR is supplied, the sample size needed to check for such a change in probability).

Also, the number of events per predictor.

This is the smaller value of the outcome $y=0$ and outcome $y=1$.

For a continuous coefficient, the calculation uses $\hat{\beta}$, the estimated coefficient from the model, $\delta$: $$\delta = \frac{1 + (1 + \hat{\beta}^2) \exp{1.25\hat{\beta}^2}}{ 1 + \exp{-0.25 \hat{\beta}^2}}$$ and $P_0$, the probability calculated from the intercept term $\beta_0$ from the logistic model

glm(x$y ~ coeff, family=binomial) as $P_0 = \frac{\exp{\beta_0}}{1 + \exp{\beta_0}}$ For a model with one predictor, the calculation is: $$n = (1 + 2P_0 \delta) \frac{z_{1-\alpha} + z_{\code{beta}} \exp{0.25 \hat{\beta}^2}^2}{ P_0 \hat{\beta}^2}$$ For a multivariable model, the value is adjusted by $R^2$, the correlation of coeff with the other predictors in the model: $$n_m = \frac{n}{1 - R^2}$$

For a binomial coefficient, the calculation uses $P_0$, the probability given the null hypothesis and $P_a$, the probability given the alternative hypothesis and and the average probability $\bar{P} = \frac{P_0 + P_a}{2}$ The calculation is: $$n = \frac{(z_{1-\alpha} \sqrt{2 \bar{P} (1 - \bar{P})} + z_{\code{beta}} \sqrt{P_0(1 - P_0) + P_a(1 - P_a)})^2}{ (P_a + P_0)^2}$$ An alternative given by Whitemore uses $\hat{P} = P(x=0)$.

The lead term in the equation below is used to correct for large values of $\hat{P}$: $$n = (1 + 2P_0) \frac{(z_{1-\alpha} \sqrt{\frac{1}{1-\hat{P}} + \frac{1}{\hat{P}}} + z_{\code{beta}} \sqrt{\frac{1}{1-\hat{P}} + \frac{1}{\hat{P} \exp{\hat{\beta}}}})^2}{ (P_0 \hat{\beta})^2}$$ As above these can be adjusted in the multivariable case: $$n_m = \frac{n}{1 - R^2}$$ In this case, Pearsons $R^2$ correlation is between the fitted values from a logistic regression with coeff as the response and the other predictors as co-variates.

The calculation uses $\bar{P}$, the mean probability (mean of the fitted values from the model): $$R^2 = \frac{(\sum{i=1}^n (y_i - \bar{P})(P_i - \bar{P}))^2}{ \sum{i=1}^n (y_i - \bar{P})^2 \sum{i=1}^n (P_i - \bar{P})^2}$$

References

Whitemore AS (1981). Sample Size for Logistic Regression with Small Response Probability. Journal of the American Statistical Association. 76(373):27-32. 10.2307/2287036 Also available at JSTOR at https://www.jstor.org/stable/2287036

Hsieh FY (1989). Sample size tables for logistic regression. Statistics in Medicine. 8(7):795-802. 10.1002/sim.4780080704 Also available at statpower (free).

Fleiss J (2003). Statistical methods for rates and proportions. 3rd ed. John Wiley, New York. 10.1002/0471445428 Also available at Google books (free preview).

Peduzzi P, Concato J, Kemper E, Holford T R, Feinstein A R (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 49(12):1373-79. 10.1016/S0895-4356(96)00236-3

Examples

Run this code

# NOT RUN {
## H&L 2nd ed. Section 8.5.
## Results here are slightly different from the text due to rounding.
data(uis)
with(uis, prop.table(table(DFREE, TREAT), 2))
(g1 <- glm(DFREE ~ TREAT, data=uis, family=binomial))
ss(g1, coeff="TREATlong")
## Pages 340 - 341.
ss(g1, coeff="TREATlong", OR=1.5, Px0=0.5)
## standardize
uis <- within(uis, {
    AGES <- (AGE - 32) / 6
    NDRGTXS <- (NDRGTX - 5) / 5
})
## H&L 2nd ed. Section 8.5. Page 343.
## results slightly different due to rounding
g1 <- glm(DFREE ~ AGES, data=uis, family=binomial) 
ss(g1, coeff="AGES", std=FALSE, OR=1.5)
## H&L 2nd ed. Section 8.5. Table 8.37. Page 344.
summary(g1 <- glm(DFREE ~ AGES + NDRGTXS + IVHX + RACE + TREAT,
                  data=uis, family=binomial))
## H&L 2nd ed. Section 8.5. Page 345.
## results slightly different due to rounding
ss(g1, coeff="AGES", std=FALSE, OR=1.5)
ss(g1, coeff="TREATlong", std=FALSE, OR=1.5)
# }

Run the code above in your browser using DataLab