y
that is related to the true, unobserved trait yTRUE as follows yTRUE=y+noise where noise is assumed to have mean zero and a constant variance. Assume you have 1 or more surrogate markers for yTRUE corresponding to the columns of datX
. The function implements several approaches for estimating yTRUE based on the inputs y
and/or datX
.TrueTrait(datX, y, datXtest=NULL,
corFnc = "bicor", corOptions = "use = 'pairwise.complete.obs'",
LeaveOneOut.CV=FALSE, skipMissingVariables=TRUE,
addLinearModel=FALSE)
datX
equals the number of observations, i.e. it should equal the length of y
datX
, i.e. the two data sets should have the same number of columns but the number or rows (test set observations) can be d"cor"
or biweight mid-correlation "bicor"
. Additional arguments to the correlation
functicorFnc
.y.true1
and y.true2
based on datX
.lm(y~., data=datX)
y
.
The first column y.true1
is the average value of standardized columns of datX
where standardization subtracts out the intercept term and divides by the slope of the linear regression model lm(marker~y). Since this estimate ignores the fact that the surrogate markers have different correlations with y
, it is typically inferior to y.true2
.
The second column y.true2
equals the weighted average value of standardized columns of datX
. The standardization is described in section 2.4 of Klemera et al. The weights are proportional to r^2/(1+r^2) where r denotes the correlation between the surrogate marker and y
. Since this estimate does not include y
as additional surrogate marker, it may be slightly inferior to y.true3
. Having said this, the difference between y.true2
and y.true3
is often negligible.
An additional column called y.lm
is added if code{addLinearModel=TRUE}. In this case, y.lm
reports the linear model predictions.
Finally, the column y.true3
is very similar to y.true2
but it includes y
as additional surrogate marker. It is expected to be the best estimate of the underlying true trait (see Klemera et al 2006).datXtest
. In this case, it contains a data frame with columns ytrue1
and ytrue2
. The
number of rows equals the number of test set observations, i.e the number of rows of datXtest
. Since
the value of y
is not known in case of a test data set, one cannot calculate y.true3
. An
additional column with linear model predictions y.lm
is added if code{addLinearModel=TRUE}.LeaveOneOut.CV
has been set to TRUE
.
In this case, it contains a data frame with leave-one-out cross validation estimates of ytrue1
and ytrue2
. The number of rows equals the length of y
. Since the value of y
is not known in case of a test data set, one cannot calculate y.true3
y.true2
and the true (unobserved) yTRUE. It corresponds to formula 33.y.true3
and the true (unobserved) yTRUE. It corresponds to formula 42.datX
) when it comes to the definition of y.true2
. The rows correspond to the number of variables. Columns report the variable name, the center (intercept that is subtracted to scale each variable), the scale (i.e. the slope that is used in the denominator), and finally the weights used in the weighted sum of the scaled variables.Strata
is different from NULL. In this case, it is has the same dimensions as datEstimates
but the estimates were calculated separately for each level of Strata
.Strata
. Each component reports the estimate of SD.ytrue2
for observations in the stratum specified by unique(Strata).y
and a list of surrogate markers corresponding to the columns of datX
.
2) There is a linear relationship between the true underlying trait and y
and the surrogate markers.
3) yTRUE =y +Noise where the Noise term has a mean of zero and a fixed variance.
4) Weighted least squares estimation is used to relate the surrogate markers to the underlying trait where the weights are proportional to 1/ssq.j where ssq.j is the noise variance of the j-th marker.Specifically,
output y.true1
corresponds to formula 31, y.true2
corresponds to formula 25, and y.true3
corresponds to formula 34.
Although the true underlying trait yTRUE is not known, one can estimate the standard deviation between the
estimate y.true2
and yTRUE using formula 33. Similarly, one can estimate the SD for the estimate
y.true3
using formula 42. These estimated SDs correspond to output components 2 and 3, respectively.
These SDs are valuable since they provide a sense of how accurate the measure is.
To estimate the correlations between y
and the surrogate markers, one can specify different
correlation measures. The default method is based on the Person correlation but one can also specify the
biweight midcorrelation by choosing "bicor", see help(bicor) to learn more.
When the datX
is comprised of observations measured in different strata (e.g. different batches or
independent data sets) then one can obtain stratum specific estimates by specifying the strata using the
argument Strata
. In this case, the estimation focuses on one stratum at a time.
Choa IH, Parka KS, Limb CJ (2010) An Empirical Comparative Study on Validation of Biological Age Estimation Algorithms with an Application of Work Ability Index. Mechanisms of Ageing and Development Volume 131, Issue 2, February 2010, Pages 69-78
# observed trait
y=rnorm(1000,mean=50,sd=20)
# unobserved, true trait
yTRUE =y +rnorm(100,sd=10)
# now we simulate surrogate markers around the true trait
datX=simulateModule(yTRUE,nGenes=20, minCor=.4,maxCor=.9,geneMeans=rnorm(20,50,30) )
True1=TrueTrait(datX=datX,y=y)
datTrue=True1$datEstimates
par(mfrow=c(2,2))
for (i in 1:dim(datTrue)[[2]] ){
meanAbsDev= mean(abs(yTRUE-datTrue[,i]))
verboseScatterplot(datTrue[,i],yTRUE,xlab=names(datTrue)[i],
main=paste(i, "MeanAbsDev=", signif(meanAbsDev,3)));
abline(0,1)
}
#compare the estimated standard deviation of y.true2
True1[[2]]
# with the true SD
sqrt(var(yTRUE-datTrue$y.true2))
#compare the estimated standard deviation of y.true3
True1[[3]]
# with the true SD
sqrt(var(yTRUE-datTrue$y.true3))
Run the code above in your browser using DataLab