rho.bounds.pred: Estimates plausible values of the Pearson's correlation coefficient between two variables observed in distinct samples referred to the same target population.

Description

This function evaluates the uncertainty in estimating the Pearson's correlation coefficient between y.rec (Y) and z.don (Z) when the two variables are observed in two different samples that share a set of common predictors (Xs). The Xs are used to predict Y and Z respectively, and then the predictions become the input for estimating the uncertainty.

Usage

rho.bounds.pred(data.rec, data.don,
                match.vars, y.rec, z.don,
                pred = "lm",
                w.rec = NULL, w.don = NULL, 
								out.pred =FALSE, ...)

Value

a list with the following components:

up.rec only when out.pred = TRUE the output list includes data.rec with the predicted values of both Y and Z;

up.don only when out.pred = TRUE the output list includes data.don with the predicted values of both Y and Z;

corr the estimated correlations between Y (Z) and the corresponding predicted values;

bounds a vector with three values: the estimated lower bound for the Pearson's correlation coefficient between y.rec(Y) and z.don (Z); the estimated upper bound; and, the mid-point of the interval that corresponds to the estimate Pearson's correlation coefficient under the conditional independence assumption.

Arguments

data.rec: dataframe including the Xs (predictors, listed in match.vars) and y.rec (response; target variable in this dataset).
data.don: dataframe including the Xs (predictors, listed in match.vars) and z.don (response; target variable in this dataset).
match.vars: vector with the names of the Xs variables to be used, as (possible) predictors of respectively y.rec and z.don.
y.rec: character indicating the name of Y target variable in data.rec. It should be a numeric variable.
z.don: character indicating the name of Z target variable in data.don. It should be a numeric variable.
pred: String specifying the method used to obtain predictions of both Y and Z. Available methods include pred = "lm" (default) fits two linear regression models (function lm) to get predictions with Y and Z as response variables and match.vars as predictors; pred = "roblm" (default) fits two robust linear regression models (function rlm in package MASS); pred = "lasso" uses the lasso method (R package glmnet, function cv.glmnet) and cross-validation to select a subset of match.vars that are the best predictors of Y (Z) and then obtain the model predictions; pred = "rf" fits randomForest to get predictions of both Y and Z (function randomForest in randomForest).
w.rec: possible name of the variable with the weights associated to the units in data.rec, if available; the weights are only used in estimating correlations, not in fitting models.
w.don: possible name of the variable with the weights associated to the units in data.don, if available; the weights are only used in estimating correlations, not in fitting models.
out.pred: Logical, when TRUE (default is FALSE) the output includes the input datasets with the predictions of both the target variables.
...: addition eventual parameters needed.

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function evaluates the uncertainty in the estimation of the Pearson's correlation coefficient between y.rec (Y) and z.don (Z), when the two variables are observed in two different samples that refer to the same target population, but that share a set of common predictors X (match.vars). The evaluation of the uncertainty corresponds to the estimation of the bounds (lower and upper) of the correlation coefficient between Y and Z, given the available data. The method uses the expressions proposed by Rodgers and DeVol (1982), but instead of using the Xs match.vars directly, they are replaced by the predictions of both Y and Z provided by the fitted models according to pred. This last way of working avoids the drawbacks encountered when estimating covariances in the presence of several X variables, some of which are categorical (factors) and therefore pose the problem of working with dummies. The final estimation of the bounds is provided by the function rho.bounds. Note that the correlations between the predictions of both Y and Z are estimated after pooling the samples. Survey weights, if available (arguments w.rec and w.don), are used in estimating the correlations, but not in fitting the models.

References

D'Orazio, M., (2024). Is Statistical Matching feasible? Note, https://www.researchgate.net/publication/387699016_Is_statistical_matching_feasible.

Rodgers, W.L. and DeVol E.B. (1982). An evaluation of statistical matching. Report Submitted to the Income Survey Development Program, Dept. of Health and Human Services, Institute for Social Reasearch, University of Michigan.

Examples

Run this code

set.seed(11335577)
pos <- sample(x = 1:150, size = 60, replace = FALSE)
ir.A <- iris[pos, c(1:3, 5)]
ir.B <- iris[-pos, c(1:2, 4:5)]

intersect(colnames(ir.A), colnames(ir.B)) # shared Xs

op1 <- rho.bounds.pred(data.rec=ir.A, data.don=ir.B, 
                       match.vars=c("Sepal.Length", "Sepal.Width", "Species"),
                       y.rec="Petal.Length", z.don="Petal.Width", 
                       pred = "lm")
op1
op2 <- rho.bounds.pred(data.rec=ir.A, data.don=ir.B, 
                       match.vars=c("Sepal.Length", "Sepal.Width", "Species"),
                       y.rec="Petal.Length", z.don="Petal.Width", 
                       pred = "roblm")
op2

Run the code above in your browser using DataLab