Returns data or a plot showing the difference of two or more data frames The differences are calculated on the base of differing metrics, chosen in the funct argument. All used data frames must contain at least one column named equal in all data frames, that has equal values.
uni_compare(
dfs,
benchmarks,
variables = NULL,
nboots = 2000,
n_bench = NULL,
boot_all = FALSE,
funct = "rel_mean",
data = TRUE,
type = "comparison",
legendlabels = NULL,
legendtitle = NULL,
colors = NULL,
shapes = NULL,
summetric = "rmse2",
label_x = NULL,
label_y = NULL,
plot_title = NULL,
varlabels = NULL,
name_dfs = NULL,
name_benchmarks = NULL,
summet_size = 4,
silence = TRUE,
conf_level = 0.95,
conf_adjustment = NULL,
percentile_ci = TRUE,
weight = NULL,
id = NULL,
strata = NULL,
weight_bench = NULL,
id_bench = NULL,
strata_bench = NULL,
adjustment_weighting = "raking",
adjustment_vars = NULL,
raking_targets = NULL,
post_targets = NULL,
ndigits = 3,
parallel = FALSE
)
A plot based on ggplot2::ggplot2()
(or data frame if data==TRUE)
which shows the difference between two or more data frames on predetermined variables,
named identical in both data frames.
A character vector containing the names of data frames to compare against the benchmarks.
A character vector containing the names of benchmarks to compare the data frames against.
The vector must either be the same length as dfs
, or length 1. If it has length 1 every
df will be compared against the same benchmark. Benchmarks can either be the name of data frames,
the name of a list of tables, or a named vector of means. The tables in the list need to be named as the respective variables
in the data frame of comparison. When they are a named vector of means, the means need to be named as the respective variables
in the dfs.
A character vector containing the names of the variables for the comparison. If NULL,
all variables named similarly in both the dfs
and the benchmarks will be compared. Variables missing
in one of the data frames or the benchmarks will be neglected for this comparison.
The number of bootstraps used to calculate standard errors. Must either be >2 or 0.
If >2 bootstrapping is used to calculate standard errors with nboots
iterations. If 0, SE
is calculated analytically. We do not recommend using nboots
=0 because this method is not
yet suitable for every funct
used and every method. Depending on the size of the data and the
number of bootstraps, uni_compare
can take a while.
A list of vectors containing the number of cases for every variable in the benchmark. This is only needed, if the benchmark is given as a vector. The list should be as long as the number of dataframes
If TURE, both, dfs and benchmarks will be bootstrapped. Otherwise the benchmark estimate is assumed to be constant.
A character string, indicating the function to calculate the difference between the data frames.
Predefined functions are:
"d_mean"
, "ad_mean"
A function to calculate the (absolute) difference in mean of
the variables in dfs
and benchmarks with the same name. Only applicable for
metric variables.
"d_prop"
, "ad_prop"
A function to calculate the (absolute) difference in proportions of
the variables in dfs
and benchmarks with the same name. Only applicable for dummy
variables.
"rel_mean"
, "abs_rel_mean"
A function to calculate the (absolute)
relative difference in mean of the variables in dfs
and benchmarks with the same name.
#' For more information on the formula for difference and analytic variance, see Felderer
et al. (2019). Only applicable for metric variables.
"rel_prop"
, "abs_rel_prop"
A function to calculate the (absolute)
relative difference in proportions of the variables in dfs
and benchmarks with
the same name. It is calculated similar to the relative difference in mean
(see Felderer et al., 2019), however the default label for the plot is different.
Only applicable for dummy variables.
If TRUE, a uni_compare_object is returned, containing results of the comparison.
Define the type of comparison. Can either be "comparison"
or "nonresponse"
.
A character string or vector of strings containing a label for the legend.
A character string containing the title of the legend.
A vector of colors, that is used in the plot for the different comparisons.
A vector of shapes applicable in ggplot2::ggplot2()
, that is
used in the plot for the different comparisons.
If "avg1"
, "mse1"
, "rmse1"
, or "R"
the respective measure is calculated for the biases of each survey. The values
"mse1"
and "rmse1"
lead to similar results as in "mse2"
and "rmse2"
,
with slightly different visualization in the plot. If summetric = NULL
, no summetric
will be displayed in the Plot. When "R"
is chosen, also response_identificator
is needed.
A character string or vector of character strings containing a label for the x-axis and y-axis.
A character string containing the title of the plot.
A character string or vector of character strings containing the new names of variables, also used in plot.
A character string or vector of character strings containing the
new names of the dfs
and benchmarks
, that is also used in plot.
A number to determine the size of the displayed summetric
in the plot.
If silence = FALSE
a warning will be displayed, if variables a
re excluded from either the data frame or benchmark, for not existing in both.
A numeric value between zero and one to determine the confidence level of the confidence interval.
If conf_adjustment = TRUE
the confidence level of the confidence interval will be
adjusted with a Bonferroni adjustment, to account for the problem of multiple comparisons.
If TURE, cofidence intervals will be calculated using the percentile method. If False, they will be calculated using the normal method.
A character vector determining variables to weight the dfs
or
benchmarks
. They have to be part of the respective data frame. If only one character is provided,
the same variable is used to weigh every df
or benchmark
. If a
weight variable is provided also an id
variable is needed.For
weighting, the survey
package is used.
A character vector determining id
variables used
to weigh the dfs
or benchmarks
with the help of the
survey
package. They have to be part of the respective data frame. If
only one character is provided, the same variable is used to weigh every
df
or benchmark
.
A character vector determining strata variables
used to weigh the dfs
or benchmarks
with the help of the
survey
package.They have to be part of the respective data frame.
If only one character is provided, the same variable is used to weight every
df
or benchmark
.
A character vector indicating if adjustment
weighting should be used. It can either be "raking"
or "post_start"
.
Variables used to adjust the survey when using raking or post stratification.
A list of raking targets that can be given to the rake
function of rake
, to rake the dfs
.
A list of post-stratification targets that can be given to the
postStratify
function, to post-stratify the dfs
.
The number of digits to round the numbers in the plot.
Can be either FALSE
or a number of cores that should
be used in the function. If it is FALSE
, only one core will be used and
otherwise the given number of cores will be used.
Felderer, B., Kirchner, A., & Kreuter, FALSE. (2019). The Effect of Survey Mode on Data Quality: Disentangling Nonresponse and Measurement Error Bias. Journal of Official Statistics, 35(1), 93–115. https://doi.org/10.2478/jos-2019-0005
## Get Data for comparison
data("card")
north<-card[card$south==0,]
white<-card[card$black==0,]
## use the function to plot the data
univar_comp<-sampcompR::uni_compare(dfs = c("north","white"),
benchmarks = c("card","card"),
variables= c("age","educ","fatheduc","motheduc","wage","IQ"),
funct = "abs_rel_mean",
nboots=200,
summetric="rmse2",
data=FALSE)
univar_comp
Run the code above in your browser using DataLab