R2: R-squared

Description

Calculate (Pseudo) R-squared for a fitted model, defined here as the squared multiple correlation between the observed and fitted values for the response variable. 'Adjusted' and 'Predicted' versions are also calculated (see Details).

Usage

R2(
  mod,
  data = NULL,
  adj = TRUE,
  pred = TRUE,
  offset = FALSE,
  re.form = NULL,
  type = c("pearson", "spearman"),
  adj.type = c("olkin-pratt", "ezekiel"),
  positive.only = TRUE,
  env = NULL
)

Value

A numeric vector of the R-squared value(s), or an array, list of vectors/arrays, or nested list.

Arguments

mod: A fitted model object, or a list or nested list of such objects.
data: An optional dataset, used to first refit the model(s).
adj, pred: Logical. If TRUE (default), adjusted and/or predicted R-squared are also returned (the latter is not available for all model types).
offset: Logical. If TRUE, include model offset(s) in the calculations (i.e. in the response variable and fitted values).
re.form: For mixed models of class "merMod", the formula for random effects to condition on when generating fitted values used in the calculation of R-squared. Defaults to NULL, meaning all random effects are included. See lme4:::predict.merMod() for further specification details.
type: The type of correlation coefficient to use. Can be "pearson" (default) or "spearman".
adj.type: The type of adjusted R-squared estimator to use. Can be "olkin-pratt" (default) or "ezekiel". See Details.
positive.only: Logical, whether to return only positive values for R-squared (negative values replaced with zero).
env: Environment in which to look for model data (if none supplied). Defaults to the formula() environment.

Details

Various approaches to the calculation of a goodness of fit measure for GLMs analogous to R-squared in the ordinary linear model have been proposed. Generally termed 'pseudo R-squared' measures, they include variance-based, likelihood-based, and distribution-specific approaches. Here however, a more straightforward definition is used, which can be applied to any model for which fitted values of the response variable are generated: R-squared is calculated as the squared (weighted) correlation between the observed and fitted values of the response (in the original units). This is simply the squared version of the correlation measure advocated by Zheng & Agresti (2000), itself an intuitive measure of goodness of fit describing the predictive power of a model. As the measure does not depend on any specific error distribution or model estimating procedure, it is also generally comparable across many different types of model (Kvalseth, 1985). In the case of the ordinary linear model, the measure is exactly equal to the traditional R-squared based on sums of squares.

If adj = TRUE (default), the 'adjusted' R-squared value is also returned, which provides an estimate of the population – as opposed to sample – R-squared. This is achieved via an analytical formula which adjusts R-squared using the 'degrees of freedom' of the model (i.e. the ratio of observations to parameters), helping to counter multiple R-squared's positive bias and guard against overfitting of the model to noise in the original sample. By default, this is calculated via the exact 'Olkin-Pratt' estimator, shown in recent simulations to be the optimal unbiased population R-squared estimator across a range of estimators and specification scenarios (Karch, 2020), and thus a good general first choice, even for smaller sample sizes. Setting adj.type = "ezekiel" however will use the simpler and more common 'Ezekiel' formula, which can be more appropriate where minimising the mean squared error (MSE) of the estimate is more important than strict unbiasedness (Hittner, 2019; Karch, 2020).

If pred = TRUE (default), a 'predicted' R-squared is also returned, which is calculated via the same formula as for R-squared but using cross-validated, rather than original, fitted values. These are obtained by dividing the model residuals (in the response scale) by the complement of the observation leverages (diagonals of the hat matrix), then subtracting these inflated 'predicted' residuals from the response variable. This is essentially a short cut to obtaining 'out-of-sample' predictions, normally arising via a 'leave-one-out' cross-validation procedure (they are not exactly equal for GLMs). The resulting R-squared is an estimate of the R-squared that would result were the model fit to new data, and will be lower than the original – and likely also the adjusted – R-squared, highlighting the loss of explanatory power due to sample noise. Predicted R-squared may be a more powerful and general indicator of overfitting than adjusted R-squared, as it provides a true out-of-sample test. This measure is a variant of an existing one, calculated by substituting the 'PRESS' statistic, i.e. the sum of squares of the predicted residuals (Allen, 1974), for the residual sum of squares in the classic R-squared formula. It is not calculated here for GLMMs, as the interpretation of the hat matrix is not reliable (see lme4:::hatvalues.merMod()).

For models fitted with one or more offsets, these will be removed by default from the response variable and fitted values prior to calculations. Thus R-squared will measure goodness of fit only for the predictors with estimated, rather than fixed, coefficients. This is likely to be the intended behaviour in most circumstances, though if users wish to include variation due to the offset(s) they can set offset = TRUE.

For mixed models, the function will, by default, calculate all R-squared measures using fitted values incorporating both the fixed and random effects, thus encompassing all variation captured by the model. This is equivalent to the 'conditional' R-squared of Nakagawa et al. (2017) (though see that reference for a more advanced approach to R-squared for mixed models). To include only some or no random effects, simply set the appropriate formula using the argument re.form, which is passed directly to lme4:::predict.merMod(). If re.form = NA, R-squared is calculated for the fixed effects only, i.e. the 'marginal' R-squared of Nakagawa et al. (2017).

As R-squared is calculated here as a squared correlation, the type of correlation coefficient can also be specified. Setting this to "spearman" may be desirable in some cases, such as where the relationship between response and fitted values is not necessarily bivariate normal or linear, and a correlation of the ranks may be more informative and/or general. This purely monotonic R-squared can also be considered a useful goodness of fit measure, and may be more appropriate for comparisons between GLMs and ordinary linear models in some scenarios.

R-squared values produced by this function will by default be in the range 0-1, meaning that any negative values arising from calculations will be converted to zero. Negative values essentially mean that the fit is 'worse' than the null expectation of no relationship between the variables, which can be difficult to interpret in practice and in any case usually only occurs in rare situations, such as where the intercept is suppressed or where a low value of R-squared is adjusted downwards via an analytic estimator. Such values are also 'impossible' in practice, given that R-squared is a strictly positive measure (as generally known). Hence, for simplicity and ease of interpretation, values less than zero are presented as a complete lack of model fit. This is also recommended by Shieh (2008), who shows for adjusted R-squared that such 'positive-part' estimators have lower MSE in estimating the population R-squared (though higher bias). To allow return of negative values however, set positive.only = FALSE. This may be desirable for simulation purposes, and/or where strict unbiasedness is prioritised.

References

Allen, D. M. (1974). The Relationship Between Variable Selection and Data Augmentation and a Method for Prediction. Technometrics, 16(1), 125-127. tools:::Rd_expr_doi("10/gfgv57")

Hittner, J. B. (2019). Ezekiel’s classic estimator of the population squared multiple correlation coefficient: Monte Carlo-based extensions and refinements. The Journal of General Psychology, 147(3), 213–227. tools:::Rd_expr_doi("10/gk53wb")

Karch, J. (2020). Improving on Adjusted R-Squared. Collabra: Psychology, 6(1). tools:::Rd_expr_doi("10/gkgk2v")

Kvalseth, T. O. (1985). Cautionary Note about R2. The American Statistician, 39(4), 279-285. tools:::Rd_expr_doi("10/b8b782")

Nakagawa, S., Johnson, P. C. D., & Schielzeth, H. (2017). The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of the Royal Society Interface, 14(134). tools:::Rd_expr_doi("10/gddpnq")

Shieh, G. (2008). Improved Shrinkage Estimation of Squared Multiple Correlation Coefficient and Squared Cross-Validity Coefficient. Organizational Research Methods, 11(2), 387–407. tools:::Rd_expr_doi("10/bcwqf3")

Zheng, B., & Agresti, A. (2000). Summarizing the predictive power of a generalized linear model. Statistics in Medicine, 19(13), 1771-1781. tools:::Rd_expr_doi("10/db7rfv")

Examples

Run this code

# Pseudo R-squared for mixed models
R2(shipley.sem)  # fixed + random ('conditional')
R2(shipley.sem, re.form = ~ (1 | tree))  # fixed + 'tree'
R2(shipley.sem, re.form = ~ (1 | site))  # fixed + 'site'
R2(shipley.sem, re.form = NA)  # fixed only ('marginal')
R2(shipley.sem, re.form = NA, type = "spearman")  # using Spearman's Rho

# Predicted R-squared: compare cross-validated predictions calculated/
# approximated via the hat matrix to standard method (leave-one-out)

# Fit test models using Shipley data – compare lm vs glm
d <- na.omit(shipley)
m <- lm(Live ~ Date + DD + lat, d)
# m <- glm(Live ~ Date + DD + lat, binomial, d)

# Manual CV predictions (leave-one-out)
cvf1 <- sapply(1:nrow(d), function(i) {
  m.ni <- update(m, data = d[-i, ])
  predict(m.ni, d[i, ], type = "response")
})

# Short-cut via the hat matrix
y <- getY(m)
f <- fitted(m)
cvf2 <- y - (y - f) / (1 - hatvalues(m))

# Compare predictions (not exactly equal for GLMs)
all.equal(cvf1, cvf2)
# lm: TRUE; glm: "Mean relative difference: 1.977725e-06"
cor(cvf1, cvf2)
# lm: 1; glm: 0.9999987

# NOTE: comparison not tested here for mixed models, as hierarchical data can
# complicate the choice of an appropriate leave-one-out procedure. However,
# there is no obvious reason why use of the leverage values (diagonals of the
# hat matrix) to estimate CV predictions shouldn't generalise, roughly, to
# the mixed model case (at least for LMMs). In any case, users should
# exercise the appropriate caution in interpretation.

Run the code above in your browser using DataLab