calc_rsq: Calculate r-squared for observed vs. predicted values

Description

Calculate the square of the Pearson correlation coefficient (r) between observed and model-predicted values

Usage

calc_rsq(pred, obs, obs_sd, n_subj, detect, log10_trans = FALSE)

Value

A numeric scalar: the R-squared value for observations vs. predictions.

Arguments

pred: Numeric vector: Model-predicted value corresponding to each observed value. Even if `log10_trans log-transformed. (If `log10_trans log10-transformed internally to this function before calculation.)
obs: Numeric vector: Observed sample means for summary data, or observed values for non-summary data. Censored observations should *not* be NA; they should be substituted with the LOQ. Even if `log10_trans TRUE`, these should *not* be log10-transformed. (If `log10_trans they will be transformed to log10-scale means internally to this function before calculation.)
obs_sd: Numeric vector: Observed sample SDs for summary data. For non-summary data (individual-subject observations), the corresponding element of `obs_sd` should be set to 0. Even if `log10_trans these should *not* be log10-transformed. (If `log10_trans will be transformed to log10-scale standard deviations internally to this function before calculation.)
n_subj: Numeric vector: Observed sample number of subjects for summary data. For non-summary data (individual-subject observations), `group_n` should be set to 1.
detect: Logical: Whether each
log10_trans: Logical. FALSE (default) means that R-squared is computed for observations vs. predictions. TRUE means that R-squared is computed for log10(observations) vs. log10(predictions) (see Details).

Author

Caroline Ring

Details

Calculate the square of the Pearson correlation coefficient (r) between observed and model-predicted values, when observed data may be left-censored (non-detect) or may be reported in summary form (as sample mean, sample standard deviation, and sample number of subjects). Additionally, handle the situation when observed data and predictions need to be log-transformed before RMSE is calculated.

$r^2$ is calculated according to the following formula, to properly handle observations reported in summary format:

$$ r^2 = \left( \frac{ \sum_{i=1}^G \mu_i n_i \bar{y}_i - (\bar{\mu} + \bar{y}) \sum_{i=1}^G n_i \mu_i + (\bar{\mu} \bar{y}) \sum_{i=1}^G n_i } { \sqrt{ \sum_{i=1}^G (n_i - 1) s_i^2 + \sum_{i=1}^G n_i \bar{y}_i^2 - 2 \bar{y} \sum_{i=1}^G n_i \bar{y}_i + N + \bar{y}^2 } \sqrt{ \sum_{i=1}^G n_i \mu_i^2 - 2 \bar{y} \sum_{i=1}^G n_i \mu_i + N + \bar{y}^2 } } \right)^2 $$

In this formula, there are $G$ groups (reported observations). (For CvTdb data, a "group" is a specific combination of chemical, species, route, medium, dose, and timepoint.) $n_i$ is the number of subjects for group $i$. $\bar{y}_i$ is the sample mean for group $i$. $s_i$ is the sample standard deviation for group $i$.$\mu_i$ is the model-predicted value for group $i$. $\bar{y}$ is the grand mean of observations:

$$ \bar{y} = \frac{ \sum_{i=1}^G n_i \bar{y}_i }{\sum_{i=1}^G n_i} $$

$\bar{\mu}$ is the grand mean of predictions:

$$ \bar{\mu} = \frac{ \sum_{i=1}^G n_i \mu_i }{\sum_{i=1}^G n_i} $$

$N$ is the grand total of subjects:

$$N = \sum_{i=1}^G n_i$$

For the non-summary case ($N$ single-subject observations, with all $n_i = 1$, $s_i = 0$, and $\bar{y}_i = y_i$), this formula reduces to the familiar formula

$$ r^2 = \left( \frac{\sum_{i=1}^N (y_i - \bar{y}) (\mu_i - \bar{\mu})} {\sqrt{ \sum_{i=1}^N (y_i - \bar{y})^2 } \sqrt{ \sum_{i=1}^N (\mu_i - \bar{\mu})^2 } } \right)^2 $$

# Left-censored data

If the observed value is censored, and the predicted value is less than the reported LOQ, then the observed value is (temporarily) set equal to the predicted value, for an effective error of zero.

If the observed value is censored, and the predicted value is greater than the reported LOQ, the the observed value is (temporarily) set equal to the reported LOQ, for an effective error of (LOQ - predicted).

# Log-10 transformation

If `log10 log10-transformed before calculating the RMSE. In the case where observed values are reported in summary format, each sample mean and sample SD (reported on the natural scale, i.e. the mean and SD of natural-scale individual observations) are used to produce an estimate of the log10-scale sample mean and sample SD (i.e., the mean and SD of log10-transformed individual observations), using [convert_summary_to_log10()].

The formulas are as follows. Again, $\bar{y}_i$ is the sample mean for group $i$. $s_i$ is the sample standard deviation for group $i$.

$$\textrm{log10-scale sample mean}_i = \log_{10} \left(\frac{\bar{y}_i^2}{\sqrt{\bar{y}_i^2 + s_i^2}} \right)$$

$$\textrm{log10-scale sample SD}_i = \sqrt{\log_{10} \left(1 + \frac{s_i^2}{\bar{y}_i^2} \right)}$$