Generates univariate synthetic data using classification and regression trees (without or with bootstrap).
syn.ctree(y, x, xp, smoothing = "", proper = FALSE,
minbucket = 5, mincriterion = 0.9, ...)
syn.cart(y, x, xp, smoothing = "", proper = FALSE,
minbucket = 5, cp = 1e-08, ...)A list with two components:
a vector of length k with synthetic values of y.
the fitted model which is an object of class rpart.object
or ctree.object that can be printed or plotted.
an original data vector of length n.
a matrix (n x p) of original covariates.
a matrix (k x p) of synthesised covariates.
smoothing method for numeric variable. See
syn.smooth.
for proper synthesis (proper = TRUE) a CART
model is fitted to a bootstrapped sample of the original data.
the minimum number of observations in
any terminal node. See rpart.control and
ctree_control for details.
complexity parameter. Any split that does not
decrease the overall lack of fit by a factor of cp is not
attempted. Small values of cp will grow large trees.
See rpart.control for details.
1 - p-value of the test that must be
exceeded for a split to be retained. Small values of
mincriterion will grow large trees.
See ctree_control for details.
additional parameters passed to
ctree_control for syn.ctree and
rpart.control for syn.cart.
The procedure for synthesis by a CART model is as follows:
Fit a classification or regression tree by binary recursive partitioning.
For each xp find the terminal node.
Randomly
draw a donor from the members of the node and take the observed
value of y from that draw as the synthetic value.
syn.ctree uses ctree function from the
party package and syn.cart uses rpart
function from the rpart package. They differ, among others,
in a selection of a splitting variable and a stopping rule for the
splitting process.
A Guassian kernel smoothing can be applied to continuous variables
by setting smoothing parameter to "density". It is recommended
as a tool to decrease the disclosure risk. Increasing minbucket
is another means of data protection.
CART models were suggested for generation of synthetic data by Reiter (2005) and then evaluated by Drechsler and Reiter (2011).
Reiter, J.P. (2005). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics, 21(3), 441--462.
Drechsler, J. and Reiter, J.P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics and Data Analysis, 55(12), 3232--3243.
syn, syn.survctree,
rpart, ctree,
syn.smooth