cvq2-package: Calculate the predictive squared correlation coefficient.

Description

This package compares observation with their predictions calculated by model M. It calculates the predictive squared correlation coefficient, $q^2$, in comparison to the well known conventional squared correlation coefficient, $r^2$.

Arguments

encoding

latin1

Details

ll{ Package: cvq2 Type: Package Version: 1.2.0 Date: 2013-10-10 Depends: methods, stats License: GPL v3 LazyLoad: yes } This package needs either a description of parameters and observations (I) or a data set that already contains the observations and their related predictions (II). In case of (I), a linear model M is generated on the fly. Afterwards, its calibration performance can be compared with its prediction power.If the input data consist of observations and precidctions only (II), the package can be used to compute either the calibration performance or the prediction power.If model M is generated on the fly (I), the procedure is as follows: The input data set consists of parameters $x_1, x_2, \ldots, x_n$ which describe an observation y. A linear regression (glm) of this data set yields to M. Thus the conventional squared correlation coefficient, $r^2$, can be calculated: $$r^2 = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{fit} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}\right)^2} \equiv 1 - \frac{RSS}{SS}$$The denominator complies with the Residual Sum of Squares RSS, the difference between the fitted values $y_i^{fit}$ predicted by M and the observations $y_i$. The numerator is the Sum of Squares, SS, and refers to the difference between the observations $y_i$ and their mean $y_{mean}$. To compare the calibration of M with its prediction power, M is applied to an external data set. External it is called, because these data have not been used during the linear regression to generate M. The comparison of the predictions $y_i^{pred}$ with the observation $y_i$ yields to the predictive squared correlation coefficient, $q^2$: $$q^2 = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}\right)^2} \equiv 1 - \frac{PRESS}{SS}$$ The PREdictive residual Sum of Squares PRESS is the difference between the predictions $y_i^{pred}$ and the observations $y_i$. The Sum of Squares SS refers to the difference between the observations $y_i$ and their mean $y_{mean}$. In case that no external data set is available, one can perform a cross-validation to evaluate the prediction performance. The cross-validation splits the model data set ($N$ elements) into a training set ($N-k$ elements) and a test set ($k$ elements). Each training set yields to an individual model M', which is used to predict the missing $k$ value(s). Each model M' is slightly different to M. Thereby any observed value $y_i$ is predicted once and the comparison between the observation and the prediction ($y_i^{pred(N-k)}$) yields to $q^2_{cv}$: $$q^2_{cv} = 1-\frac{\sum\limits_{i=1}^N\left( y_i^{pred(N-k)} - y_i\right)^2}{\sum\limits_{i=1}^N\left( y_i - y_{mean}^{N-k,i}\right)^2}$$ The arithmetic mean used in this equation, $y_{mean}^{N-k,i}$, is individually for any test set and calculated for the observed values comprised in the training set. If $k > 1$, the compilation of training and test set may have impact on the calculation of the predictive squared correlation coefficient. To overcome biasing, one can repeat this calculation with various compilations of training and test set. Thus, any observed value is predicted several times, according to the number of runs performed.Remark, if the prediction performance is evaluated with cross-validation, the calculation of the predictive squared correlation coefficient, $q^2$, is more accurate than the calculation of the conventional squared correlation coefficient, $r^2$.In addition to $r^2$ and $q^2$ the root-mean-square-error rmse is calculated to measure the accuracy of model M: $$rmse = \sqrt{\frac{\sum\limits_{i=1}^N\left( y_i^{pred} - y_i\right)^2}{N-\nu}}$$ The rmse ist the difference between a model's prediction ($y_i^{pred}$) and the actual observation ($y_i$) and can be applied for both, calibration and prediction power. It depends on the number of observations N and the method used to generate the model M. The rmse tends to overestimate M. According to Friedrich Bessel's suggestion [Upton and Cook 2008], this overestimation can be resolved while regarding the degrees of freedom, $\nu$. Thus in case of cross-validation, $\nu = 1$ is recommended to calculate the rmse in relation to the prediction power. The degrees of freedom, $\nu$, for the calculation of rmse regarding the prediction power can be set as parameter for cvq2(), looq2() and q2(). In opposite $\nu = 0$ is fixed while calculating the rmse in relation to the model calibration.In case, the input is a comparison of observed and predicted values only (II), $r^2$ respective $q^2$ as well as their rmse are calculated immediately for these data. Neither a model M is generated nor a cross-validation is applied.

References

Cramer RD III. 1980. BC(DEF) Parameters. 2. An Empirical Structure-Based Scheme for the Prediction of Some Physical Properties.J. Am. Chem. Soc.102:1849-1859.
Cramer RD III, Bunce JD, Patterson DE, Frank IE. 1988. Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Linear Regression in Conventional QSAR Studies.Quant. Struct.-Act. Relat.1988:18-25. % \item Reichmann, WJ. 1961. Use and abuse of statistics. \emph{Oxford University Press}. Appendix VIII
Organisation for Economic Co-operation and Development. 2007. Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models.OECD Series on Testing and Assessment 69.OECD Document ENV/JM/MONO(2007)2, pp 55 (paragraph no. 198) and 65 (Table 5.7).
Sch��rmann{Schuurmann} G, Ebert R-U, Chen J, Wang B,K�hne{Kuhne} R. 2008. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean.J. Chem. Inf. Model.48:2140-2145.
Tropsha A, Gramatica P, Gombar VK. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models.QSAR Comb. Sci.22:69-77. % siehe auch % http://www.oxfordreference.com % http://www.oxfordreference.com/view/10.1093/acref/9780199541454.001.0001/acref-9780199541454-e-1704?rskey=Vlbjn7&result=2159
Upton G, Cook I. 2008. Oxford Dictionary of StatisticsOxford University PressISBN 978-0-19-954145-4entry for "Variance (data)".

Examples

Run this code

library(cvq2)
  
  data(cvq2.sample.A)
  result <- cvq2( cvq2.sample.A, y ~ x1 + x2 )
  result
  
  data(cvq2.sample.B)
  result <- cvq2( cvq2.sample.B, y ~ x, nFold = 3 )
  result
  
  data(cvq2.sample.B)
  result <- cvq2( cvq2.sample.B, y ~ x, nFold = 3, nRun = 5 )
  result
  
  data(cvq2.sample.A)
  data(cvq2.sample.A_pred)
  result <- q2( cvq2.sample.A, cvq2.sample.A_pred, y ~ x1 + x2 )
  result
  
  data(cvq2.sample.C)
  result <- calibPow( cvq2.sample.C )
  result
  
  data(cvq2.sample.D)
  result <- predPow( cvq2.sample.D, obs_mean="observed_mean" )
  result

Run the code above in your browser using DataLab