disclosure: Disclosure measures

Description

Calculates disclosure measures for synthetic data. NOTE: The other function that calculates disclosure results for multiple targets has been renamed as multi.disclosure from disclosure.summary.

Usage

# S3 method for synds
disclosure(object, data, keys , target , print.flag = TRUE,
           denom_lim = 5, exclude_ov_denom_lim = FALSE, not.targetlev = NULL,
           usetargetNA = TRUE, usekeysNA = TRUE, 
           exclude.keys =NULL, exclude.keylevs = NULL, exclude.targetlevs = NULL,
           ngroups_target = NULL, ngroups_keys = NULL, 
           thresh_1way = c(50, 90),thresh_2way = c(4, 80),
           digits = 2, to.print =c("short"),...) 
# S3 method for data.frame
disclosure(object, data,cont.na = NULL, keys , target , print.flag = TRUE,
           denom_lim = 5, exclude_ov_denom_lim = FALSE, 
           not.targetlev = NULL,
           usetargetNA = TRUE, usekeysNA = TRUE, 
           exclude.keys =NULL, exclude.keylevs = NULL, exclude.targetlevs = NULL,
           ngroups_target = NULL, ngroups_keys = NULL, 
           thresh_1way = c(50, 90),thresh_2way = c(4, 80),
           digits = 2, to.print =c("short"), compare.synorig = TRUE, ...) 
# S3 method for list
disclosure(object, data,cont.na = NULL, keys , target , print.flag = TRUE,
           denom_lim = 5, exclude_ov_denom_lim = FALSE, 
           not.targetlev = NULL,
           usetargetNA = TRUE, usekeysNA = TRUE, 
           exclude.keys =NULL, exclude.keylevs = NULL, exclude.targetlevs = NULL,
           ngroups_target = NULL, ngroups_keys = NULL, 
           thresh_1way = c(50, 90),thresh_2way = c(4, 80),
           digits = 2, to.print =c("short"), compare.synorig = TRUE, ...) 
           
# S3 method for disclosure
print(x,  to.print =NULL, digits = NULL, ...)

Value

An object of class disclosure which is a list with the following components.

call: the call that created the object.
ident: Table of measures of identity disclosure one for each synthesis. Measures are "UiO","UiS","UiSiO" and "repU". See vignette disclosure.pdf for an explanation of these and the following measures.
attrib: Table of measures of attribute disclosure one for each synthesis. These include "DiO","DiS","iSO","DiSCO" and "DiSDiO". The measures "DiO" and "DiS" are the percentage of the target that are disclosed from the original and synthetic data with these keys. The next measure "iSO" gives the percentage of the key combinations in the synthetic data that are present in the original - one was in which the disclosure. "DiSCO" gives the percentage of original records where the attribution to the target is correct as judged from the original. "DiSDiO" gives the % of origina; records in "DISCO" that are unique in the original data. The table also as gives the maximum and mean of the denominators for the "DiSCO" measure i.e. the distribution for every record that leads to a correct disclosure of the number of observations with the same keys and the same correct target in the synthetic data. Large denominators are often an indication that the disclosure is something that might be expected from prior knowledge of relations.
allCAPs: Table of the following measures of correct attribution probability: "baseCAPd","CAPd", "CAPs" , "DCAP" and "TCAP"'
check_1way: A data frame with one record per synthesis identifying the level of the target with numbers of disclosive records that are above thresholds defined by thresh_1way, with default value c(50,90). This means that there must be more than 50 disclosive records with this level of the target, and that 90% or more of all disclosive records must have this target. The value of most_dis_lev will be blank if no level exceeds these thresholds. Note this level will be identified for data without excluded or missing values of keys if there are any excluded records.
check1: The level of the target identified by check_1way ` or blank if none
check_2way: A list of length number of syntheses giving details for each of the two-way combinations of target and keys where the the numbers of disclosive records are above thresholds defined by thresh_2way. The default value for this is c(5, 80), meaning that there must be at least 5 records with this combination of targets and keys and that 80% or more of records in the original data with this level of the key will have this level of the target. If no combinations exceed thresh_2way for one of the syntheses then the list element is NULL. Such disclosive combinations are often associated with a high prior probability of the target from just this level of one of the keys in the original data. Note these combinations will be identified for data without excluded or missing values of keys if there are any excluded combinations or target if any of usekeysNA or usetargetNA are FALSE.
Nexclusions: A list of length number of syntheses with number of records excluded from attribute measures for different reasons.
keys: as input
digits: as input
Norig: Number of records in data
to.print: as input

Arguments

object: an object of class synds, which stands for 'synthesised data set'. It is typically created by function syn() and it includes object$m synthesised data set(s) as object$syn. This a single data set when object$m = 1 or a list of length object$m when object$m > 1. Alternatively, when data are synthesised not using syn(), it can be a data frame with a synthetic data set or a list of data frames with synthetic data sets, all created from the same original data with the same variables.
data: the original (observed) data set.
cont.na: For data NOT supplied as a synthetic data object created by synthpop, this gives special values for continuous variables as described in the documentation for the function syn.
keys: vector of variable names or column numbers in data that are also present in the synthetic data to act as quasi-identifiers for identity or attribute disclosure.
target: name of target variable for attribute disclosure.
denom_lim: Limit to use to exclude large key-target group, see next item.
exclude_ov_denom_lim: logical to exclude key targetcombinations that contribute more than denom_lim disclosive records. These are often flagged from thresh_2way where the first element corresponds to denom_lim
print.flag: logical value as to whether a line is printed as disclosure is calculated for each synthetic data set.
digits: number of digits to print for disclosure measures.
usetargetNA: determines whether NA values in target are to be used in checking for disclosure
usekeysNA: determines whether NA values in keys are to be used in checking for disclosure.
not.targetlev: Character variable giving level of target to be excluded from disclosure measures. Usually identified by checklev_1way.
exclude.keys: vector of names of keys that, with the next two items will define the target and key combinations to be excluded from the calculation of disclosure measures. Often identified by checklev_2way.
exclude.keylevs: vector of the same length as exclude.keys that give the levels to be excluded for the corresponding key.
exclude.targetlevs: vector of the same length as exclude.keys that give the levels of target that will be excluded for each key and key level.
ngroups_target: Unless set to NULL (the default) a numeric target variable will be grouped into ngroups_target categories.
ngroups_keys: Unless set to NULL (the default) any numeric variable will be grouped into categories. If ngroups_keys is of length 1 all numeric keys will be have the same number of groups. Otherwise ngroups_keys needs to be the same length as keys and will give the number of groups for each key. If an element of ngroups_keys is zero, no grouping will be done.
thresh_1way: A vector of two numeric values both of which meed to be exceeded for warnings about a level of the target that may be dominating the results. The first is the count of all disclosive records for this level of the target, and the second is the % of all original records for this level of the target. Default is c(50, 90), meaning a group of 50 disclosive records for this level of the target where they make up over 90% of all disclosive records.
thresh_2way: A vector of two numeric values both of which meed to be exceeded for warnings about a level of the target that may be dominating the results. The first is the count of disclosive records for a quasi-identifier used to identify possible s that are searched for the most disclosive key-target combination. The second is the percentage of all original records for each combination examined that must be exceeded to trigger a warning. Default is c(5, 80), meaning a pairs found from key-target groups of more than 5 records where over 80% of all the original values with these key-target pairs have this level of the target.
to.print: Vector to determine what aspects of an object of class disclosure will be printed. Must consist of one or more of the following "short", "ident", "attrib","allCAPs", "all", "check_1way", "check_2way", "exclusions". Default is "short" giving a brief summary.
compare.synorig: a logical value to determine if the functions synorig.compare() should be used to check that data sets can be compared. Used when the synthetic data are supplied as a data.frame or a list when default set to TRUE.
...: additional parameters
x: an object of class disclosure.

Details

Calculates identity disclosure measures for a for a set of keys, (quasi identifiers) and attribute disclosure measures for one variable from the same set of keys considered as a target. The function multi.disclosure calls this function and summarises the attribute disclosure measures for multiple targets. See the vignette

References

See references in package vignette

Examples

Run this code

library(synthpop)
ods <- SD2011[, c("sex", "age", "edu", "marital", "income")]
odsF <- numtocat.syn(ods, numtocat = "income", catgroups = 7, cont.na = list(income = -8))
s1 <- syn(odsF$data, method = "ctree",seed = 75, m=3, k=1000)
disc1 <- disclosure(s1, odsF$data, target = "income", 
                    keys = c("sex", "age", "edu","marital"))

Run the code above in your browser using DataLab