Fbounds.pred: Estimates Frechet bounds for cells in the contingency table crossing two categorical variables observed in distinct samples referred to the same target population.

Description

This function assesses the uncertainty in estimating the contingency table crossing y.rec (Y) and z.don (Z) when the two variables are observed in two different samples sharing a number of common predictors.

Usage

Fbounds.pred(data.rec, data.don,
             match.vars, y.rec, z.don, pred = "multinom",
             w.rec = NULL, w.don = NULL, type.pred = "random",
             out.pred = FALSE, ...)

Value

a list with the following components:

up.rec only when out.pred = TRUE it corresponds to a smaller version of data.rec with the estimated conditional probabilities for both Y and Z (depending on pred argument), the predicted class of Y (depending on type.pred argument), the predicted class of Z (depending on type.pred argument), the true observed class of Y and the predictors (argument match.vars) (and the weights when w.rec is specified).

up.don only when out.pred = TRUE it corresponds to a smaller version of data.don with the estimated conditional probabilities for both Y and Z (depending on pred argument), the predicted class of Y (depending on type.pred argument), the predicted class of Z (depending on type.pred argument), the true observed class of Z and the predictors (argument match.vars) (and the weights when w.don is specified).

p.xx.ini the estimated relative frequencies in the table crossing predictions of Y and Z; it is estimated after pooling the samples (weighted average of estimates obtained on the separates samples);

p.xy.ini the estimated table crossing Y and the predictions of both Y and Z estimated from data.rec (weights are used if provided with the w.rec argument);

p.xz.ini the estimated table crossing Z and the predictions of both Y and Z estimated from data.don (weights are used if provided with the w.don argument);

accuracy the estimated accuracy in predicting respectively Y and Z with the chosen method (argument pred) and the available predictors (argument match.vars);

bounds a data.frame whose columns reports the estimated unconditional and conditional bounds for each cell in the contingency table crossing y.rec(Y) and z.don (Z);

uncertainty the uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the predictions (for further details see Frechet.bounds.cat.

Arguments

data.rec: dataframe including the Xs (predictors, listed in match.vars) and y.rec (response; target variable in this dataset)
data.don: dataframe including the Xs (predictors, listed in match.vars) and z.don (response; target variable in this dataset)
match.vars: vector with the names of the Xs variables to be used as predictors (or set in which select the best predictors with lasso) of respectively y.rec and z.don
y.rec: character indicating the name of Y target variable in data.rec. It should be a factor.
z.don: character indicating the name of Z target variable in data.don. It should be a factor.
pred: character specifying the method used to obtain predictions of both Y and Z. Available methods include pred = "multinom" (default) fits two multinomial models (nnet function multinom) to get predictions with Y and Z as response variables and match. vars as predictors; pred = "lasso" uses the lasso method (R package glmnet, function cv.glmnet) and cross-validation to select a subset of match.vars that are the best predictors of Y and Z, respectively, and then fits the multinomial models with the selected predictors; pred = "nb" uses the Naive Bayes classifier to get predictions of Y and Z respectively (R package naivebayes function naive_bayes); pred = "rf" fits randomForest to get predictions of both Y and Z (function randomForest in randomForest).
w.rec: name of the variable with the weights of the units in data.rec, if available (default is NULL); the weights, if available, are only used for estimating bounds, not for fitting models.
w.don: name of the variable with the weights of the units in data.don, if available (default is NULL); the weights, if available, are only used for estimating bounds, not for fitting models.
type.pred: string specifying how to obtain the predictions of Y and Z. By default, the fitted models return conditional probabilities (or scores), then if type.pred = "random" (default), the predicted class of Y (Z) is obtained by a random draw with selection probabilities equal to the estimated conditional probabilities (scores); on the contrary, if type.pred = "mostvoted", the predicted class is the one with the highest estimated conditional probability (score).
out.pred: Logical. If TRUE (default is FALSE) returns the input datasets with the estimated conditional probabilities (depending on pred argument), the predicted class for the target variable (Y or Z) in the dataset (depending on type.pred argument) and the true observed class of Y (or Z).
...: additional arguments, if needed.

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

The function evaluates the uncertainty in estimating the contingency table crossing y.rec (Y) and z.don (Z) when the two variables are observed in two different samples related to the same target population, but the samples share a number of common predictors. The evaluation of the uncertainty is equivalent to estimating the bounds for each cell in the contingency table where Y and Z intersect; the bounds can be unconditional (Frechet property) or conditional on the predictions of both Y and Z provided by the models fitted according to the pred argument. This latter way of working avoids many of the drawbacks of obtaining expectations of conditional bounds when conditioning on many X variables, and allows the inclusion of non-categorical predictors. The final estimation of the conditional bounds is provided by the function Frechet.bounds.cat.

References

D'Orazio, M., (2024). Is Statistical Matching feasible? Note, https://www.researchgate.net/publication/387699016_Is_statistical_matching_feasible.

Examples

Run this code

data(quine, package="MASS") #loads quine from MASS
str(quine)

# split quine in two subsets
set.seed(223344)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:3]
quine.B <- quine[-lab.A, 2:4]

# multinomial model and predictions with most-voted criterion
fbp <- Fbounds.pred(data.rec = quine.A, data.don = quine.B, 
                    match.vars = c("Sex", "Age"), 
                    y.rec = "Eth", z.don = "Lrn", 
                    pred = "multinom", type.pred = "mostvoted")

fbp$p.xx.ini # estimated cross-tab of predictions
fbp$bounds # estimated conditional and unconditional bounds
fbp$uncertainty  # estimated uncertainty about Y*Z

# multinomial model and predictions with randomized criterion
fbp <- Fbounds.pred(data.rec = quine.A, data.don = quine.B, 
                    match.vars = c("Sex", "Age"), 
                    y.rec = "Eth", z.don = "Lrn", 
                    pred = "multinom", type.pred = "random")

fbp$p.xx.ini # estimated cross-tab of predictions
fbp$bounds # estimated conditional and unconditional bounds
fbp$uncertainty  # estimated uncertainty about Y*Z

Run the code above in your browser using DataLab