Produces tables from observed and synthesised data and calculates utility measures to compare them with their expectation if the synthesising model is correct.
It can be also used with synthetic data NOT created by syn()
,
but then an additional parameter cont.na
might need to be provided.
# S3 method for synds
utility.tab(object, data, vars = NULL, ngroups = 5,
useNA = TRUE, max.table = 1e6,
print.tables = length(vars) < 4,
print.stats = c("pMSE", "S_pMSE", "df"),
print.zdiff = FALSE, print.flag = TRUE,
digits = 4, k.syn = FALSE, ...)# S3 method for data.frame
utility.tab(object, data, vars = NULL, cont.na = NULL,
ngroups = 5, useNA = TRUE, max.table = 1e6,
print.tables = length(vars) < 4,
print.stats = c("pMSE", "S_pMSE", "df"),
print.zdiff = FALSE, print.flag = TRUE,
digits = 4, k.syn = FALSE,
compare.synorig = TRUE, ...)
# S3 method for list
utility.tab(object, data, vars = NULL, cont.na = NULL,
ngroups = 5, useNA = TRUE, max.table = 1e6,
print.tables = length(vars) < 4,
print.stats = c("pMSE", "S_pMSE", "df"),
print.zdiff = FALSE, print.flag = TRUE,
digits = 4, k.syn = FALSE,
compare.synorig = TRUE, ...)
# S3 method for utility.tab
print(x, print.tables = NULL,
print.zdiff = NULL, print.stats = NULL,
digits = NULL, ...)
An object of class utility.tab
which is a list with the following
components:
number of synthetic data sets in object, i.e. object$m
.
a vector with object$m
values for the Voas Williamson
utility measure.; linearly related to pMSE
.
a vector with object$m
values for the Freeman-Tukey
utility measure.
a vector with object$m
values for the Jensen-Shannaon
divergence for comparing the tables.
a vector with object$m
values for the Kolmogorov-Smirnov
statistic for comparing the propensity scores for the original and synthetic
data.
a vector with object$m
values of the weighted mean
absolute difference in distributions for original and synthetic data.
a vector with object$m
values of the Wilcoxon statistic
comparing the propensity scores for the original and synthetic data.
a vector with object$m
values for the adjusted likelihood
ratio utility measure.
a vector with object$m
values of the propensity score
mean-squared error; linearly related to VW
.
a vector with object$m
values of the percentage over
50% of observations correctly predicted from the propensity scores
linearly related to SPECKS
and MabsDD
.
a vector with object$m
values of the mean absolute
difference in distributions for original and synthetic data linearly
related to SPECKS
and PO50
.
a vector with object$m
values of the Bhattacharyya
distances between the synthetic and original data, linearly related to
the square root of FT
.
VW/df
.
FT/df
.
JSD
/df.
WMabsDD/df.
G/df
.
standardised measure from pMSE
, identical to S_VW
.
a vector of degrees of freedom for the chi-square tests which equal
to the number of cells in the tables with any observed or
synthesised counts minus one when k.syn == FALSE
or equal to the
the number of cells when k.syn == TRUE
.
degrees of freedom used in standardising G
.
a vector of length object$m
with number of cells
not contributing to the statistics.
a table from the observed data.
a table or a list of m
tables from the synthetic data.
a table or a list of m
tables of Z statistics for
differences between observed and synthesised cells of the tables. Large
absolute values indicate a large contribution to lack-of-fit.
an integer indicating the number of decimal places
for printing statistics, tab.zdiff
and mean results for m > 1
.
a logical value that determines if tables of observed and synthesised are to be printed.
a single string or a vector of strings with utility measures to be printed out.
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
number of observation in the original dataset.
a logical indicator as to whether the sample size itself has been synthesised.
an object of class synds
, which stands for 'synthesised
data set'. It is typically created by function syn()
or
syn.strata()
and it includes object$m
number of synthesised
data set(s), as well as object$syn
the synthesised data set,
if m = 1
, or a list of m
such data sets. Alternatively,
when data are synthesised not using syn()
, it can be a data frame
with a synthetic data set or a list of data frames with synthetic data sets,
all created from the same original data with the same variables and the same
method.
the original (observed) data set.
a single string or a vector of strings with the names of variables to be used to form the table.
a named list of codes for missing values for continuous
variables if different from the R
missing data code NA
.
The names of the list elements must correspond to the variables names for
which the missing data codes need to be specified.
a maximum table size. You could try increasing the default value, but memory problems are likely.
if numerical (non-factor) variables are included they will be
classified into this number of groups to form tables. Classification is
performed using classIntervals()
function for n = ngroups
.
By default, style = "quantile"
to get appropriate groups for skewed
data. Problems for variables with a small number of unique values are handled
by selecting only unique values of breaks. Arguments of classIntervals()
may be, however, specified in the call to utility.tab()
.
determines if NA values are to be included in tables.
a logical value that determines if tables of observed and synthesised data are to be printed. By default tables are printed if they have up to three dimensions.
a single string or a vector of strings that determines
which utility measures to print. Must be a selection from:
"VW"
, "FT"
,"JSD"
, "SPECKS"
, "WMabsDD"
,
"U"
, "G"
, "pMSE"
, "PO50"
, "MabsDD"
,
"dBhatt"
, "S_VW"
, "S_FT"
, "S_JSD"
,
"S_WMabsDD"
, "S_G"
, "S_pMSE"
, "df"
, dfG
.
If print.stats = "all"
, all of these will be printed. For more
information see the details section below.
a logical value that determines if tables of Z scores for differences between observed and expected are to be printed.
a logical value that determines if messages are to be printed during computation.
an integer indicating the number of decimal places for printing
statistics, tab.zdiff
and mean results for m > 1
.
a logical indicator as to whether the sample size itself has
been synthesised. The default value is FALSE
, which will apply
to synthetic data created by synthpop.
a logical value to determine if the functions
synorig.compare()
should be used to check that data sets can be
compared. Used when the synthetic data are supplied as a data.frame or
a list when default set to TRUE.
additional parameters; can be passed to classIntervals() function.
an object of class utility.tab
.
Forms tables of observed and synthesised values for the variables
specified in vars
. Several utility measures are calculated from the cells
of the tables, as described below. Details of all of these measures can be found
in Raab et al. (2021). If the synthesising model is correct the measures
VW
, FT
, G
and JSD
should have chi-square distributions
with df
degrees of freedom for large samples. Standardised versions of each
measure are available (e.g. S_VW
for VW
, where S_VW = VW/df
)
that will have an expected value of 1
if the synthesising model is correct.
Four other measures are calculated by considering the table as a prediction model.
The propensity score mean-squared error pMSE
, and from a comparison of
propensity scores for the synthetic and original data the Kolmogorov-Smirnov
statistic SPECKS
and the Wilcoxon rank-sum statistic U
and also
the percentage of the observations correctly predicted in the combined tables over
50%(PO50
) where the majority of observations in each grouping are in
agreement with category (real or synthetic) of the observation. The first of these
pMSE
is identical except for a constant to VW
. No expected values are
computed for the last three of these measures, but they can be obtained by replication
from utility.gen()
.
Three further measures are calculated from the tables. The mean absolute difference
in distributions: firstly MabsDD
, the avarage absolute difference in the
proportions of original and synthetic data from all the cells in the table.
Secondly a weighted version of this measure WMabsDD
where the weights are
proportional to the inverse of the variance of the absolute differences so that
this measure can be standardised by its expected value, df
. Finally the
Bhattacharyya distances BhattD
derived from the overlap of the histograms
of the original and synthetic data sets.
Nowok, B., Raab, G.M and Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74(11), 1-26. tools:::Rd_expr_doi("10.18637/jss.v074.i11").
Raab, G.M., Nowok, B. and Dibben, C. (2021). Assessing, visualizing and improving the utility of synthetic data. Available from https://arxiv.org/abs/2109.12717.
Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.
Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.
utility.gen
ods <- SD2011[1:1000, c("sex", "age", "marital", "nofriend")]
s1 <- syn(ods, m = 10, cont.na = list(nofriend = -8))
utility.tab(s1, ods, vars = c("marital", "sex"), print.stats = "all")
s2 <- syn(ods, m = 1, cont.na = list(nofriend = -8))
u2 <- utility.tab(s2, ods, vars = c("marital", "age", "sex"), ngroups = 3)
print(u2, print.tables = TRUE, print.zdiff = TRUE)
### synthetic data provided as 'data.frame'
utility.tab(s2$syn, ods, vars = c("marital", "nofriend"), ngroups = 3,
print.tables = TRUE, cont.na = list(nofriend = -8), digits = 4)
Run the code above in your browser using DataLab