rankNND.hotdeck: Rank distance hot deck method.

Description

This function implements rank hot deck distance method. For each recipient record the closest donors is chosen by considering the distance between the percentage points of the empirical cumulative distribution function.

Usage

rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, 
                 don.class=NULL,  weight.rec=NULL, weight.don=NULL,
                 constrained=FALSE, constr.alg="Hungarian",
                 keep.t=FALSE)

Value

A R list with the following components:

mtc.ids: A matrix with the same number of rows of data.rec and two columns. The first column contains the row names of the data.rec and the second column contains the row names of the corresponding donors selected from the data.don. When the input matrices do not contain row names, then a numeric matrix with the indexes of the rows is provided.
dist.rd: A vector with the distances between each recipient unit and the corresponding donor.
noad: The number of available donors at the minimum distance for each recipient unit (only in unconstrained case)
call: How the function has been called.

Arguments

data.rec

A numeric matrix or data frame that plays the role of recipient. This data frame must contain the variable var.rec to be used in computing the percentage points of the empirical cumulative distribution function and eventually the variables that identify the donation classes (see argument don.class) and the case weights (see argument weight.rec).

Missing values (NA) are not allowed.

data.don

A matrix or data frame that plays the role of donor. This data frame must contain the variable var.don to be used in computing percentage points of the the empirical cumulative distribution function and eventually the variables that identify the donation classes (see argument don.class) and the case weights (see argument weight.don).

var.rec

A character vector with the name of the variable in data.rec that should be ranked.

var.don

A character vector with the name of the variable data.don that should be ranked. If not specified, by default var.don=var.rec.

don.class

A character vector with the names of the variables (columns in both the data frames) that identify donation classes. In each donation class the computation of percentage points is carried out independently. Then only distances between percentage points of the units in the same donation class are computed. The case of empty donation classes should be avoided. It would be preferable that the variables used to form donation classes are defined as factor.

When not specified (default), no donation classes are used.

weight.rec

Eventual name of the variable in data.rec that provides the weights that should be used in computing the the empirical cumulative distribution function for var.rec (see Details).

weight.don

Eventual name of the variable in data.don that provides the weights that should be used in computing the the empirical cumulative distribution function for var.don (see Details).

constrained

Logical. When constrained=FALSE (default) each record in data.don can be used as a donor more than once. On the contrary, when
constrained=TRUE each record in data.don can be used as a donor only once. In this case, the set of donors is selected by solving a transportation problem, in order to minimize the overall matching distance. See description of the argument constr.alg for details.

constr.alg

A string that has to be specified when constrained=TRUE. Two choices are available: “lpSolve” and “Hungarian”. In the first case, constr.alg="lpSolve", the transportation problem is solved by means of the function lp.transport available in the package lpSolve. When constr.alg="Hungarian" (default) the transportation problem is solved using the Hungarian method, implemented in function solve_LSAP available in the package clue. Note that
constr.alg="Hungarian" is faster and more efficient.

keep.t

Logical, when donation classes are used by setting keep.t=TRUE prints information on the donation classes being processed (by default keep.t=FALSE).

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function finds a donor record for each record in the recipient data set. The chosen donor is the one at the closest distance in terms of empirical cumulative distribution (Singh et al., 1990). In practice the distance is computed by considering the estimated empirical cumulative distribution for the reference variable (var.rec and var.don) in data.rec and data.don. The empirical cumulative distribution function is estimated by:

$$ \hat{F}(y) = \frac{1}{n} \sum_{i=1}^{n} I(y_i\leq y) $$

being $I()=1$ if $y_i\leq y$ and 0 otherwise.

In presence of weights, the empirical cumulative distribution function is estimated by:

$$ \hat{F}(y) = \frac{\sum_{i=1}^{n} w_i I(y_i\leq y)}{\sum_{i=1}^{n} w_i} $$

In the unconstrained case, when there are more donors at the same distance, one of them is chosen at random.

When the donation class are introduced, then the empirical cumulative distribution function is estimated independently in each donation classes and the search of a recipient is restricted to donors in the same donation class.

A donor can be chosen more than once. To avoid it set constrained=TRUE. In such a case a donor can be chosen just once and the selection of the donors is carried out by solving a transportation problem with the objective of minimizing the overall matching distance (sum of the distances recipient-donor).

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59--79.

Examples

Run this code


data(samp.A, samp.B, package="StatMatch") #loads data sets

# samp.A plays the role of recipient
?samp.A

# samp.B plays the role of donor
?samp.B


# rankNND.hotdeck()
# donation classes formed using "area5"
# ecdf conputed on "age"
# UNCONSTRAINED case
out.1 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age",
                         don.class="area5")
fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.1$mtc.ids, z.vars="labour5")
head(fused.1)

#  as before but ecdf estimated  using weights
# UNCONSTRAINED case
out.2 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age",
                         don.class="area5",
                         weight.rec="ww", weight.don="ww")
fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.2$mtc.ids, z.vars="labour5")
head(fused.2)

Run the code above in your browser using DataLab