R2: R-squared/Pseudo R-squared

Description

Calculate R-squared or pseudo R-squared for a fitted model, defined as the squared multiple correlation between the observed and fitted values for the response variable. 'Adjusted' and 'Predicted' versions are also calculated (see Details).

Usage

R2(mod, data = NULL, adj = TRUE, pred = TRUE, re.form = NULL, ...)

Arguments

mod

A fitted model object of class "lm", "glm", or "merMod", or a list or nested list of such objects.

data

An optional dataset used to first re-fit the model(s).

adj, pred

Logical. If TRUE (default), adjusted and/or predicted R squared are also returned.

re.form

For mixed models of class "merMod", the formula for random effects to condition on when generating fitted values used in the calculation of R-squared. Defaults to NULL, meaning all random effects are included. See predict.merMod for further specification details.

...

Not currently used.

Value

A numeric vector of the R-squared value(s), or an array, list of vectors/arrays, or nested list.

Details

Various approaches to the calculation of a goodness-of-fit measure for GLM's analogous to R-squared in the ordinary linear model have been proposed. Generally termed 'pseudo R-squared' measures, they include variance-based, likelihood-based, and distribution-specific approaches. Here however, a more straightforward definition is used, which can be applied to any model for which fitted values of the response variable are generated: R-squared is calculated as the squared (weighted) correlation between the observed and fitted values of the response (in the original units). This is simply the squared version of the correlation measure advocated by Zheng & Agresti (2000), itself an intuitive measure of goodness-of-fit describing the predictive power of a model. As the measure does not depend on any specific error distribution or model estimating procedure, it is also generally comparable across many different types of model (Kvalseth 1985). In the case of the ordinary linear model, the measure equals the more traditional R-squared based on sums of squares.

If adj = TRUE (default), the 'adjusted' R-squared value is also returned, which provides an estimate of the population - as opposed to sample - R-squared. This is achieved via an analytical formula which adjusts R-squared for the 'degrees of freedom' of the model (i.e. the ratio of observations to parameters). Here, this is calculated via the 'Pratt' rather than standard 'Ezekiel/Wherry' formula, shown in a previous simulation to be the most effective of a range of existing formulas at estimating the population R-squared, across a range of model specification scenarios (Yin & Fan 2001). Adjusted R-squared can be used to safeguard against overfitting of the model to the original sample.

If pred = TRUE (default), a 'predicted' R-squared is also returned, which is calculated via the same formula as for R-squared but using cross-validated rather than standard fitted values. These are obtained by dividing model response residuals by the complement of the observation leverages (diagonals of the hat matrix), then subtracting these inflated 'predicted' residuals from the response variable. This is essentially a short cut to obtaining out-of-sample predictions, normally arising via a leave-one-out cross-validation procedure (in a GLM they will not be exactly equal to such predictions). The resulting R-squared is an estimate of the R-squared that would occur were the model to be fitted to new data, and will be lower than the original R-squared, and likely also the adjusted R-squared - highlighting the loss of explanatory power when predicting new data. This measure is a variant of an existing one, calculated by substituting the 'PRESS' statistic, i.e. the sum of squares of the predicted residuals (Allen 1974), for the residual sum of squares in the classic R-squared formula.

For mixed models, the function will, by default, calculate all R-squared metrics using fitted values incorporating both the fixed and random effects, thus encompassing all variation captured by the model. This is equivalent to the 'conditional' R-squared of Nakagawa et al. (2017). To include only some or no random effects, simply set the appropriate formula using the argument re.form, which is passed directly to predict.merMod. If re.form = NA, R-squared is calculated for the fixed effects only - equivalent to the 'marginal' R-squared of Nakagawa et al. (2017).

R-squared values produced by this function will always be bounded between zero (no fit) and one (perfect fit), meaning that any negative values arising from calculations will be rounded up to zero. Negative values typically mean that the fit is 'worse' than the null expectation of no relationship between the variables, which can be difficult to interpret in practice and in any case usually only occurs in rare situations, such as where the intercept is suppressed. Hence, for simplicity and ease of interpretation, values less than zero are presented here as a complete lack of model fit.

References

Allen, D. M. (1974). The Relationship Between Variable Selection and Data Augmentation and a Method for Prediction. Technometrics, 16(1), 125-127. https://doi.org/gfgv57

Kvalseth, T. O. (1985) Cautionary Note about R2. The American Statistician, 39(4), 279-285. https://doi.org/b8b782

Nakagawa, S., Johnson, P.C.D. and Schielzeth, H. (2017) The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of the Royal Society Interface 14(134). https://doi.org/gddpnq

Yin, P. and Fan, X. (2001) Estimating R2 Shrinkage in Multiple Regression: A Comparison of Different Analytical Methods. The Journal of Experimental Education 69(2), 203-224. https://doi.org/fbdq5g

Zheng, B. and Agresti, A. (2000) Summarizing the predictive power of a generalized linear model. Statistics in Medicine 19(13), 1771-1781. https://doi.org/db7rfv

Examples

Run this code

# NOT RUN {
## Pseudo R-squared for mixed models
R2(Shipley.SEM)  # fixed + random ('conditional')
R2(Shipley.SEM, re.form = ~ (1 | tree))  # fixed + 'tree'
R2(Shipley.SEM, re.form = ~ (1 | site))  # fixed + 'site'
R2(Shipley.SEM, re.form = NA)  # fixed only ('marginal')

## Predicted R-squared: compare cross-validated predictions calculated/
## approximated via the hat matrix to standard method (leave-one-out)

# }
# NOT RUN {
## Fit test models using Shipley data - compare lm vs glm
d <- na.omit(Shipley)
# m <- lm(Live ~ Date + DD + lat, d)
m <- glm(Live ~ Date + DD + lat, binomial, d)

## Manual cv predictions (leave-one-out)
cvf1 <- sapply(1:nrow(d), function(i) {
  m.ni <- update(m, data = d[-i, ])
  predict(m.ni, d[i, ], type = "response")
})

## Short-cut via the hat matrix
y <- getY(m)
f <- fitted(m)
cvf2 <- y - (y - f) / (1 - hatvalues(m))

## Compare predictions (not exactly equal for GLM's)
all.equal(cvf1, cvf2)
# lm: TRUE; glm: "Mean relative difference: 1.977725e-06"
cor(cvf1, cvf2)
# lm: 1; glm: 0.9999987

# }
# NOT RUN {
# NOTE: comparison not tested here for mixed models, as hierarchical data can
# complicate the choice of an appropriate leave-one-out procedure. However,
# there is no reason why use of the leverage values (diagonals of the hat
# matrix) to estimate CV predictions shouldn't generalise (roughly?) to the
# mixed model case. In any case, users should exercise the appropriate
# caution in interpretation of the predicted R-squared for mixed models,
# especially GLMM's.
# }

Run the code above in your browser using DataLab