Learn R Programming

synthpop (version 1.9-0)

syn.ipf: Synthesis of a group of categorical variables by iterative proportional fitting

Description

A fit to the table is obtained from the log-linear fit that matches the numbers in the margins specified by the margin parameters.

Usage

syn.ipf(x, k, proper = FALSE, priorn = 1, structzero = NULL, 
        gmargins = "twoway", othmargins = NULL, tol = 1e-3,
        max.its = 5000, maxtable = 1e8, print.its = FALSE,
        epsilon = 0, delta = 0.05, 
        noisetype = "Laplace", ...)

Value

A list with two components:

res

a data frame with k rows containing the synthesised data.

fit

a list made up of two lists: the margins fitted and the original data for each margin.

Arguments

x

a data frame of the set of original data to be synthesised.

k

a number of rows in each synthetic data set - defaults to n.

proper

if proper = TRUE x is replaced with a bootstrap sample before synthesis, thus effectively sampling from the posterior distribution of the model, given the data.

priorn

the sum of the parameters of the Dirichlet prior which can be thought of as a pseudo-count giving the number of observations that inform prior knowledge about the parameters.

structzero

a named list of lists that defines which cells in the table are structural zeros and will remain as zeros in the synthetic data, by leaving their prior as zeros. Each element of the structzero list is a list that describes a set of cells in the table defined by a combination of two or more variables and a name of each such element must consist of those variable names seperated by an underscore, e.g. sex_edu. The length of each such element is determined by the number of variables and each component gives the variable levels (numeric or labels) that define the structural zero cells (see an example below).

gmargins

a single character to define a group of margins. At present there is "oneway" and "twoway" option that creates, respectively, all 1-way and 2-way margins from the table.

othmargins

a list of margins that will be fitted. If gmargins is not NULL othmargins will be added to them.

tol

stopping criterion for Ipfp.

max.its

maximum umber of iterations allowed for Ipfp.

maxtable

the number of cells in the cross-tabulation of all the variables that will trigger a severe warning.

print.its

if true the iterations from Ipfp will be printed on the console. Otherwise only a message as to whether the iterations have converged will be given at the end of the fitting.

epsilon

epsilon value for overall differential privacy (DP) parameter. This is implemented by dividing the privacy budget equally over all the margins used to fit the data.

delta

Parameter for epsilon-delta differential privacy (DP) parameter, when noisetype = "Gaussian".

noisetype

One of "Laplace" or "Gaussian" to determine the type of noise to be added that will make the synthesis DP (Laplace) or approximately DP (Gaussian). For noisetype "Gaussian" your synthesis will fail if epsilon >1.

...

additional parameters.

Details

When used in syn function the group of variables with method = "ipf" must all be together at the start of the visit sequence. This function is designed for categorical variables, but it can also be used for numerical variables if they are categorised by specifying them in the numtocat parameter of the main function syn. Subsequent variables in visit.sequence are then synthesised conditional on the synthesised values of the grouped variables. A fit to the table is obtained from the log-linear fit that matches the numbers in the margins specified by the margin parameters. Prior probabilities for the proportions in each cell of the table are given by a Dirichlet distribution with the same parameter for every cell in the table that is not a structural zero. The sum of these parameters is priorn. The default priorn = 1 can be thought of as equivalent to the knowledge that 1 observation would be equally likely to fall in any cell of the table. The synthetic data are generated from a multinomial distribution with parameters given by the expected posterior probabilities for each cell of the table. If the maximum likelihood estimate from the log-linear fit to cell \(c_i\) is \(p_i\) and the table has \(N\) cells that are not structural zeros then the expectation of the posterior probability for this cell is \((p_i + priorn/N^2) / (1 + priorn / N^2)\) or equivalently \((N * p_i + priorn/N) / (N + priorn / N)\).

Unlike syn.satcat, which fits saturated models from their conditional distributions, x can include any combination of variables, including those not present in the original data, except those defined by structzero.

NOTE that when the function is called by setting elements of method in syn to "ipf", the parameters priorn, structzero, gmargins, othmargins, tol, max.its, maxtable, print.its and epsilon, must be supplied to syn as e.g. ipf.priorn.

Examples

Run this code
ods <- SD2011[, c(1, 4, 5, 6, 2, 10, 11)]
table(ods[, c("placesize", "region")])

# Each `placesize_region` sublist: 
# for each relevant level of `placesize` defined in the first element, 
# the second element defines regions (variable `region`) that do not 
# have places of that size.

struct.zero <- list(
  placesize_region = list(placesize = "URBAN 500,000 AND OVER", 
                          region = c(2, 4, 5, 8:13, 16)),
  placesize_region = list(placesize = "URBAN 200,000-500,000", 
                          region = c(3, 4, 10:11, 13)),
  placesize_region = list(placesize = "URBAN 20,000-100,000", 
                          region = c(1, 3, 5, 6, 8, 9, 14:15)))
                          
# you could use  the object struct.zero in the command below 
# byt devtools checking did not like it so have added the list instead

synipf <- syn(ods, method = c(rep("ipf", 4), "ctree", "normrank", "ctree"), 
              ipf.gmargins = "twoway", ipf.othmargins = list(c(1, 2, 3)),
              ipf.priorn = 2, ipf.structzero = list(
  placesize_region = list(placesize = "URBAN 500,000 AND OVER", 
                          region = c(2, 4, 5, 8:13, 16)),
  placesize_region = list(placesize = "URBAN 200,000-500,000", 
                          region = c(3, 4, 10:11, 13)),
  placesize_region = list(placesize = "URBAN 20,000-100,000", 
                          region = c(1, 3, 5, 6, 8, 9, 14:15))))

Run the code above in your browser using DataLab