The procedure accepts a `data.frame` or `data.table`
containing all necessary information for the record swapping, e.g
parameter `hid`, `similar`, `hierarchy`, etc ...
First, the micro data in `data` is ordered by `hid` and the identification
risk is calculated for each record in each hierarchy level. As of right
now only counts is used as identification risk and the inverse of counts
is used as sampling probability.
NOTE: It will be possible to supply an identification risk for each record
and hierarchy level which will be passed down to the C++-function. This
is however not fully implemented.
With the parameter `k_anonymity` a k-anonymity rule is applied to define
risky households in each hierarchy level. A household is set to risky
if counts < k_anonymity in any hierarchy level and the household needs
to be swapped across this hierarchy level.
For instance, having a geographic hierarchy of NUTS1 > NUTS2 > NUTS3 the
counts are calculated for each geographic variable and defined
`risk_variables`. If the counts for a record falls below `k_anonymity`
for hierarchy county (NUTS1, NUTS2, ...) then this record needs to be swapped
across counties.
Setting `k_anonymity = 0` disables this feature and no risky households
are defined.
After that the targeted record swapping is applied, starting from the highest
to the lowest hierarchy level and cycling through all possible geographic
areas at each hierarchy level, e.g every county, every municipality in
every county, etc, ...
At each geographic area, a set of values is created for records to be
swapped. In all but the lowest hierarchy level, this is ONLY made out
of all records which do not fulfil the k-anonymity and have not already
been swapped. Those records are swapped with records not belonging to
the same geographic area, which have not already been swapped beforehand.
Swapping refers to the interchange of geographic variables defined in
`hierarchy`. When a record is swapped all other records containing the
same `hid` are swapped as well.
At the lowest hierarchy level in every geographic area, the set of records to
be swapped is made up of all records which do not fulfil the k-anonymity
as well as the remaining number of records such that the proportion of
swapped records of the geographic area is in coherence with the `swaprate`.
If due to the k-anonymity condition, more records have already been swapped
in this geographic area then only the records which do not fulfil the
k-anonymity are swapped.
Using the parameter `similar` one can define similarity profiles.
`similar` needs to be a list of vectors with each list entry containing
column indices of `data`. These entries are used when searching for donor
households, meaning that for a specific record the set of all donor
records is made out of records which have the same values in
`similar[[1]]`. It is however important to note, that these variables
can only be variables related to households (not persons!). If no suitable
donor can be found the next similarity profile is used, `similar[[2]]` and
the set of all donors is then made up out of all records which have the
same values in the column indices in `similar[[2]]`. This procedure
continues until a donor record was found or all the similarity profiles
have been used.
`swaprate` sets the swaprate of households to be swapped, where a single
swap counts for swapping 2 households, the sampled household and the
corresponding donor. Prior to the procedure, the swaprate is applied on
the lowest hierarchy level, to determine the target number of swapped
households in each of the lowest hierarchies. If the target numbers of a
decimal point they will randomly be rounded up or down such that the
number of households swapped in total is in coherence to the swaprate.