Calculates disclosure measures for synthetic data. NOTE: The other function that calculates disclosure results for multiple targets has been renamed as multi.disclosure from disclosure.summary.
# S3 method for synds
disclosure(object, data, keys , target , print.flag = TRUE,
denom_lim = 5, exclude_ov_denom_lim = FALSE, not.targetlev = NULL,
usetargetNA = TRUE, usekeysNA = TRUE,
exclude.keys =NULL, exclude.keylevs = NULL, exclude.targetlevs = NULL,
ngroups_target = NULL, ngroups_keys = NULL,
thresh_1way = c(50, 90),thresh_2way = c(4, 80),
digits = 2, to.print =c("short"),...) # S3 method for data.frame
disclosure(object, data,cont.na = NULL, keys , target , print.flag = TRUE,
denom_lim = 5, exclude_ov_denom_lim = FALSE,
not.targetlev = NULL,
usetargetNA = TRUE, usekeysNA = TRUE,
exclude.keys =NULL, exclude.keylevs = NULL, exclude.targetlevs = NULL,
ngroups_target = NULL, ngroups_keys = NULL,
thresh_1way = c(50, 90),thresh_2way = c(4, 80),
digits = 2, to.print =c("short"), compare.synorig = TRUE, ...)
# S3 method for list
disclosure(object, data,cont.na = NULL, keys , target , print.flag = TRUE,
denom_lim = 5, exclude_ov_denom_lim = FALSE,
not.targetlev = NULL,
usetargetNA = TRUE, usekeysNA = TRUE,
exclude.keys =NULL, exclude.keylevs = NULL, exclude.targetlevs = NULL,
ngroups_target = NULL, ngroups_keys = NULL,
thresh_1way = c(50, 90),thresh_2way = c(4, 80),
digits = 2, to.print =c("short"), compare.synorig = TRUE, ...)
# S3 method for disclosure
print(x, to.print =NULL, digits = NULL, ...)
An object of class disclosure
which is a list with the following
components.
the call that created the object.
Table of measures of identity disclosure one for each synthesis. Measures are "UiO","UiS","UiSiO" and "repU". See vignette disclosure.pdf for an explanation of these and the following measures.
Table of measures of attribute disclosure one for each synthesis. These include "DiO","DiS","iSO","DiSCO" and "DiSDiO". The measures "DiO" and "DiS" are the percentage of the target that are disclosed from the original and synthetic data with these keys. The next measure "iSO" gives the percentage of the key combinations in the synthetic data that are present in the original - one was in which the disclosure. "DiSCO" gives the percentage of original records where the attribution to the target is correct as judged from the original. "DiSDiO" gives the % of origina; records in "DISCO" that are unique in the original data. The table also as gives the maximum and mean of the denominators for the "DiSCO" measure i.e. the distribution for every record that leads to a correct disclosure of the number of observations with the same keys and the same correct target in the synthetic data. Large denominators are often an indication that the disclosure is something that might be expected from prior knowledge of relations.
Table of the following measures of correct attribution probability: "baseCAPd","CAPd", "CAPs" , "DCAP" and "TCAP"'
A data frame with one record per synthesis
identifying the level of the target with numbers of disclosive records
that are above thresholds defined by thresh_1way
, with default
value c(50,90). This means that there must be more than 50 disclosive records
with this level of the target, and that 90% or more of all disclosive
records must have this target. The value of most_dis_lev will be blank
if no level exceeds these thresholds.
Note this level will be identified for data without
excluded or missing values of keys if there are any excluded records.
The level of the target identified by check_1way ` or blank if none
A list of length number of syntheses giving details
for each of the two-way combinations of target and keys where the
the numbers of disclosive records are above thresholds defined by
thresh_2way
. The default value for this is c(5, 80),
meaning that there must be at least 5 records with this combination
of targets and keys and that 80% or more of records in the original
data with this level of the key will have this level of the target.
If no combinations exceed thresh_2way
for one of the syntheses
then the list element is NULL.
Such disclosive combinations are often associated with a high prior
probability of the target from just this level of one of the keys
in the original data.
Note these combinations will be identified for data without
excluded or missing values of keys if there are any excluded combinations
or target if any of usekeysNA
or usetargetNA
are FALSE.
A list of length number of syntheses with number of records excluded from attribute measures for different reasons.
as input
as input
Number of records in data
as input
an object of class synds
, which stands for
'synthesised data set'. It is typically created by function syn()
and
it includes object$m
synthesised data set(s) as object$syn
.
This a single data set when object$m = 1
or a list of length
object$m
when object$m > 1
. Alternatively, when data are
synthesised not using syn()
, it can be a data frame with a synthetic
data set or a list of data frames with synthetic data sets, all created from
the same original data with the same variables.
the original (observed) data set.
For data NOT supplied as a synthetic data object created by
synthpop
, this gives special values for continuous variables as
described in the documentation for the function syn
.
vector of variable names or column numbers in data that are also present in the synthetic data to act as quasi-identifiers for identity or attribute disclosure.
name of target variable for attribute disclosure.
Limit to use to exclude large key-target group, see next item.
logical to exclude key targetcombinations
that contribute more than denom_lim
disclosive records.
These are often flagged from thresh_2way
where the
first element corresponds to denom_lim
logical value as to whether a line is printed as disclosure is calculated for each synthetic data set.
number of digits to print for disclosure measures.
determines whether NA values in target are to be used in checking for disclosure
determines whether NA values in keys are to be used in checking for disclosure.
Character variable giving level of target to be excluded from disclosure measures. Usually identified by checklev_1way.
vector of names of keys that, with the next two items will define the target and key combinations to be excluded from the calculation of disclosure measures. Often identified by checklev_2way.
vector of the same length as exclude.keys that give the levels to be excluded for the corresponding key.
vector of the same length as exclude.keys that give the levels of target that will be excluded for each key and key level.
Unless set to NULL (the default) a numeric target variable
will be grouped into ngroups_target
categories.
Unless set to NULL (the default) any numeric variable
will be grouped into categories. If ngroups_keys
is of length 1 all numeric
keys will be have the same number of groups. Otherwise ngroups_keys
needs to be the same length as keys and will give the number of groups for each
key. If an element of ngroups_keys
is zero, no grouping will be done.
A vector of two numeric values both of which meed to be exceeded for warnings about a level of the target that may be dominating the results. The first is the count of all disclosive records for this level of the target, and the second is the % of all original records for this level of the target. Default is c(50, 90), meaning a group of 50 disclosive records for this level of the target where they make up over 90% of all disclosive records.
A vector of two numeric values both of which meed to be exceeded for warnings about a level of the target that may be dominating the results. The first is the count of disclosive records for a quasi-identifier used to identify possible s that are searched for the most disclosive key-target combination. The second is the percentage of all original records for each combination examined that must be exceeded to trigger a warning. Default is c(5, 80), meaning a pairs found from key-target groups of more than 5 records where over 80% of all the original values with these key-target pairs have this level of the target.
Vector to determine what aspects of an object of class disclosure will be printed. Must consist of one or more of the following "short", "ident", "attrib","allCAPs", "all", "check_1way", "check_2way", "exclusions". Default is "short" giving a brief summary.
a logical value to determine if the functions
synorig.compare()
should be used to check that data sets can be
compared. Used when the synthetic data are supplied as a data.frame or
a list when default set to TRUE.
additional parameters
an object of class disclosure
.
Calculates identity disclosure measures for a for a set of keys,
(quasi identifiers) and attribute disclosure measures for one
variable from the same set of keys considered as a target. The
function multi.disclosure
calls this function and
summarises the attribute disclosure measures for multiple targets.
See the vignette
See references in package vignette
syn
multi.disclosure
library(synthpop)
ods <- SD2011[, c("sex", "age", "edu", "marital", "income")]
odsF <- numtocat.syn(ods, numtocat = "income", catgroups = 7, cont.na = list(income = -8))
s1 <- syn(odsF$data, method = "ctree",seed = 75, m=3, k=1000)
disc1 <- disclosure(s1, odsF$data, target = "income",
keys = c("sex", "age", "edu","marital"))
Run the code above in your browser using DataLab