Learn R Programming

synthpop (version 1.9-0)

utility.tables: Tables and plots of utility measures

Description

Calculates and plots tables of utility measures. The calculations of utility measures are done by the function utility.tab. Options are all one-way tables, all two-way tables or three-way tables for a specified third variable along with pairs of all other variables.

This function can be also used with synthetic data NOT created by syn(), but then an additional parameters not.synthesised and cont.na might need to be provided.

Usage

# S3 method for synds
utility.tables(object, data,
               tables = "twoway", maxtables = 5e4,
               vars = NULL, third.var = NULL,
               useNA = TRUE, ngroups = 5,
               tab.stats = c("pMSE", "S_pMSE", "df"), 
               plot.stat = "S_pMSE", plot = TRUE, max.table = 1e07,
               print.tabs = FALSE, digits.tabs = 4,
               max.scale = NULL, min.scale = 0, plot.title = NULL,
               nworst = 5, ntabstoprint = 0, k.syn = FALSE, 
               low = "grey92", high = "#E41A1C",
               n.breaks = NULL, breaks = NULL, print.flag = TRUE, ...)
               
# S3 method for data.frame
utility.tables(object, data, 
               cont.na = NULL, not.synthesised = NULL, 
               tables = "twoway", maxtables = 5e4,
               vars = NULL, third.var = NULL, 
               useNA = TRUE, ngroups = 5, 
               tab.stats = c("pMSE", "S_pMSE", "df"), 
               plot.stat = "S_pMSE", plot = TRUE, max.table = 1e07, 
               print.tabs = FALSE, digits.tabs = 4,
               max.scale = NULL, min.scale = 0, plot.title = NULL,  
               nworst = 5, ntabstoprint = 0, k.syn = FALSE,
               low = "grey92", high = "#E41A1C",
               n.breaks = NULL, breaks = NULL, 
               compare.synorig = TRUE, print.flag = TRUE,...)

# S3 method for list utility.tables(object, data, cont.na = NULL, not.synthesised = NULL, tables = "twoway", maxtables = 5e4, vars = NULL, third.var = NULL, useNA = TRUE, ngroups = 5, tab.stats = c("pMSE", "S_pMSE", "df"), plot.stat = "S_pMSE", plot = TRUE, max.table = 1e07, print.tabs = FALSE, digits.tabs = 4, max.scale = NULL, min.scale = 0, plot.title = NULL, nworst = 5, ntabstoprint = 0, k.syn = FALSE, low = "grey92", high = "#E41A1C", n.breaks = NULL, breaks = NULL, compare.synorig = TRUE, print.flag = TRUE,...)

# S3 method for utility.tables print(x, print.tabs = NULL, digits.tabs = NULL, plot = NULL, plot.title = NULL, max.scale = NULL, min.scale = NULL, nworst = NULL, ntabstoprint = NULL, ...)

Value

An object of class utility.tab which is a list with the following components:

tabs

a table with all the selected measures for all combinations of variables defined by tables, third.var, and vars.

plot.stat

measure used in mat and toplot.

tables

see above.

third.var

see above.

utility.plot

plot of the selected utility measure.

var.scores

an average of utility scores for all combinations with other variables.

plot

see above.

print.tabs

see above.

digits.tabs

see above.

plot.title

see above.

max.scale

see above.

min.scale

see above.

ntabstoprint

see above.

nworst

see above.

worstn

variable combinations with nworst worst utility scores.

worsttabs

observed and synthetic cross-tabulations for worstn.

Arguments

object

an object of class synds, which stands for 'synthesised data set'. It is typically created by function syn() and it includes object$m synthesised data set(s) as object$syn. This a single data set when object$m = 1 or a list of length object$m when object$m > 1. Alternatively, when data are synthesised not using syn(), it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables and the same method.

data

the original (observed) data set.

cont.na

a named list of codes for missing values for continuous variables if different from the R missing data code NA. The names of the list elements must correspond to the variables names for which the missing data codes need to be specified.

not.synthesised

a vector of variable names for any variables that has been left unchanged in the synthetic data.

tables

defines the type of tables to produce. Options are "oneway", "twoway" (default) or "threeway". If set to "oneway" or "twoway" all possible tables from vars are produced. For "threeway", third.var may be specified and all three-way tables between this variable and other pairs of variables are produced. If a third variable is not specified the function chooses the variable with the largest median utility measure for all three-way tables it contributes to.

maxtables

maximum number of tables that will be produced. If number of tables is larger, then utility is only measured for a sample of size maxtables. You cannot produce plots of twoway or three way tables from sampled tables

.

vars

a vector of strings with the names of variables to be used to form the table, or a vector of variable numbers in the original data. Defaults to all variables in both original and synthetic data.

third.var

when tables is "threeway" a variable to make the third variable with all other pairs

useNA

determines if NA values are to be included in tables. Only applies for method "tab".

ngroups

if numerical (non-factor) variables included with method = "tab" will be classified into this number of groups to form tables. Classification is performed using classIntervals() function for n = ngroups. By default, style = "quantile", to get appropriate groups for skewed data. Problems for variables with a small number of unique values are handled by selecting only unique values of breaks. Arguments of classIntervals() may be, however, specified in the call to utility.tables().

tab.stats

statistics to include in the table of results. Must be a selection from: "VW", "FT","JSD", "SPECKS", "WMabsDD", "U", "G", "pMSE", "PO50", "MabsDD", "dBhatt", "S_VW", "S_FT", "S_JSD", "S_WMabsDD", "S_G", "S_pMSE", "df", dfG. If tab.stats = "all", all of these will be included. See utility.tab for explanations of measures.

plot.stat

statistics to plot. Choice is "VW", "FT", "JSD", "SPECKS", "WMabsDD", "U", "G", "pMSE", "PO50", "MabsDD", "dBhatt", "S_VW", "S_FT", "S_JSD", "S_WMabsDD", "S_G", "S_pMSE". See utility.tab for explanations of measures.

plot

determines if plot will be produced when the result is printed.

max.table

Value of maximum number of cells allowed in a table by the function utility.tab

print.tabs

logical value that determines if table of results is to be printed.

digits.tabs

number of digits to print for table, except for p-values that are always printed to 4 places.

max.scale

a numeric value for the maximum value used in calculating the shading of the plots. If it is NULL then the maximum value will be replaced by the maximum value in the data.

min.scale

a numeric value for the minimum value used in calculating the shading of the plots. If it is NULL then the minimum value will be replaced by zero.

plot.title

title for the plot.

nworst

a number of variable combinations with worst utility scores to be printed.

ntabstoprint

a number of tables to print for observed and synthetic data with the worst utility.

k.syn

a logical indicator as to whether the sample size itself has been synthesised.

low

colour for low end of the gradient.

high

colour for high end of the gradient.

n.breaks

a number of break points to create if breaks are not given directly.

breaks

breaks for a two colour binned gradient.

compare.synorig

a logical value to determine if the functions synorig.compare() should be used to check that data sets can be compared. Used when the synthetic data are supplied as a data.frame or a list when default set to TRUE.

print.flag

Allows printing of message as metrics are calculated for each element of the table. Default is TRUE.

...

additional parameters

x

an object of class utility.tables.

Details

Calculates tables of observed and synthesised values for the variables specified in vars with the function utility.tab and produces tables and plots of one-way, two-way or three-way utility measures formed from vars. Several options for utility measures can be selected for printing or plotting. Details are in help file for utility.tab.

The tables and variables with the worst utility scores are identified. Visualisations of the matrices of utility scores are plotted. For threeway tables a third variable can be defined to select all tables involving that variable for plotting. If it is not specified the variable with tables giving the worst utility is selected as the third variable.

References

Read, T.R.C. and Cressie, N.A.C. (1988) Goodness--of--Fit Statistics for Discrete Multivariate Data, Springer--Verlag, New York.

Voas, D. and Williamson, P. (2001) Evaluating goodness-of-fit measures for synthetic microdata. Geographical and Environmental Modelling, 5(2), 177-200.

See Also

utility.tab

Examples

Run this code
ods <- SD2011[1:1000, c("sex", "age", "edu", "marital", "region", "income")]
s1 <- syn(ods)

### synthetic data provided as a 'synds' object  
(t1 <- utility.tables(s1, ods, tab.stats = "all", print.tabs = TRUE))
### synthetic data provided as a 'data.frame' object
(t1 <- utility.tables(s1$syn, ods, tab.stats = "all", print.tabs = TRUE))

t2 <- utility.tables(s1, ods, tables = "twoway")
print(t2, max.scale = 3)

(t3 <- utility.tables(s1, ods, tab.stats = "all", tables = "threeway", 
                      third.var = "sex", print.tabs = TRUE))

(t4 <- utility.tables(s1, ods, tab.stats = "all", tables = "threeway", 
                      third.var = "sex", useNA = FALSE, print.tabs = TRUE))

(t5 <- utility.tables(s1, ods,  tab.stats = "all", 
                      print.tabs = TRUE))

Run the code above in your browser using DataLab