rankNND.hotdeck: Rank distance hot deck method.

Description

This function implements rank hot deck distance method. For each recipient record the closest donors is chosen by considering the distance among the percentage points of the empirical cumulative distribution function.

Usage

rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, 
                 don.class=NULL,  weight.rec=NULL, weight.don=NULL,
                 constrained=FALSE, constr.alg="Hungarian")

Arguments

data.rec

A numeric matrix or data frame that plays the role of recipient. This data frame must contain the variable var.rec to be used in computing the percentage points of the empirical cumulative distribution function and eventually the va

data.don

A matrix or data frame that plays the role of donor. This data frame must contain the variable var.don to be used in computing percentage points of the the empirical cumulative distribution function and eventually the variables that

var.rec

A character vector with the name of the variable in data.rec that should be ranked.

var.don

A character vector with the name of the variable data.don that should be ranked. If not specified, by default var.don=var.rec.

don.class

A character vector with the names of the variables (columns in both the data frames) that have to be used to identify donation classes. In each donation class the computation of percentage points is carried out independently. Then only distances among p

weight.rec

Eventual name of the variable in data.rec that provides the weights that should be used in computing the the empirical cumulative distribution function for var.rec (see Details).

weight.don

Eventual name of the variable in data.don that provides the weights that should be used in computing the the empirical cumulative distribution function for var.don (see Details).

constrained

Logical. When constrained=FALSE (default) each record in data.don can be used as a donor more than once. On the contrary, when constrained=TRUE each record in data.don can be used as a donor only once

constr.alg

A string that has to be specified when constrained=TRUE. Two choices are available: lpSolve and Hungarian. In the first case, constr.alg="lpSolve", the transportation problem is solved by means

Value

A Rlist with the following components:
mtc.idsA matrix with the same number of rows of data.rec and two columns. The first column contains the row names of the data.rec and the second column contains the row names of the corresponding donors selected from the data.don. When the input matrices do not contain row names, then a numeric matrix with the indexes of the rows is provided.
dist.rdA vector with the distances among each recipient unit and the corresponding donor.
noadThe number of available donors at the minimum distance for each recipient unit (only in unconstrained case)
callHow the function has been called.

Details

This function finds a donor record for each record in the recipient data set. The chosen donor is the one at the closest distance in terms of empirical cumulative distribution (Singh et al., 1990). In practice the distance is computed by considering the estimated empirical cumulative distribution for the reference variable (var.rec and var.don) in data.rec and data.don. The empirical cumulative distribution function is estimated by:

$$\hat{F}(y) = \frac{1}{n} \sum_{i=1}^{n} I(y_i\leq y)$$

being $I()=1$ if $y_i\leq y$ and 0 otherwise.

In the presence of weights the empirical cumulative distribution function is estimated by:

$$\hat{F}(y) = \frac{\sum_{i=1}^{n} w_i I(y_i\leq y)}{\sum_{i=1}^{n} w_i}$$

In the unconstrained case, when there are more donors at the same distance, one of them is chosen at random.

Note that when the donation classes are introduced then empirical cumulative distribution function is estimated independently in each donation classes and the search of a recipient is restricted to donors in the same donation class.

A donor can be chosen more than once. To avoid this set constrained=TRUE. In such a case a donor can be chosen just once and the selection of the donors is carried out by solving a transportation problem with the objective of minimizing the overall matching distance (sum of the distances recipient-donor).

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19, 59--79.

Examples

Run this code

require(SDaA)
data(agpop, agsrs, agstrat, package="SDaA") #loads ag datasets from SDaA
str(agpop)
str(agsrs)
str(agstrat)

agsrs$w.srs <- nrow(agpop)/nrow(agsrs) # add weights

# adds region to agsrs
state.region <- data.frame(xtabs(weight~state+region, data=agstrat))
state.region <- subset(state.region, Freq>0)
agsrs <- merge(agsrs, state.region[,1:2], by="state", all.x=TRUE)

# simulate statistical matching framework
A <- agsrs[, c("region", "acres82", "acres87", "w.srs")]
B <- agstrat[, c("region", "acres82", "acres92", "weight")]

# simplest call to rankNND.hotdeck()
# UNCONSTRAINED case
out.1 <- rankNND.hotdeck(data.rec=A, data.don=B, var.rec="acres82")
fused.1 <- create.fused(data.rec=A, data.don=B,
                        mtc.ids=out.1$mtc.ids, z.vars="acres92")
head(fused.1)

#  call to rankNND.hotdeck() with usage of weights
# UNCONSTRAINED case
out.2 <- rankNND.hotdeck(data.rec=A, data.don=B, var.rec="acres82",
                         weight.rec="w.srs", weight.don="weight")
fused.2 <- create.fused(data.rec=A, data.don=B,
                        mtc.ids=out.2$mtc.ids, z.vars="acres92")
head(fused.2)

#  call to rankNND.hotdeck() with usage of weights and don classes
# UNCONSTRAINED case
out.3 <- rankNND.hotdeck(data.rec=A, data.don=B, var.rec="acres82",
                         don.class="region", weight.rec="w.srs", weight.don="weight")
fused.3 <- create.fused(data.rec=A, data.don=B,
                        mtc.ids=out.3$mtc.ids, z.vars="acres92")
head(fused.3)

# call to rankNND.hotdeck()
# CONSTRAINED case
out.1c <- rankNND.hotdeck(data.rec=A, data.don=B, var.rec="acres82",
                         constrained=TRUE, constr.alg="Hungarian")
fused.1c <- create.fused(data.rec=A, data.don=B,
                        mtc.ids=out.1c$mtc.ids, z.vars="acres92")
head(fused.1c)

#  call to rankNND.hotdeck() with usage of weights
# CONSTRAINED case
out.2c <- rankNND.hotdeck(data.rec=A, data.don=B, var.rec="acres82",
                          weight.rec="w.srs", weight.don="weight",
                          constrained=TRUE, constr.alg="Hungarian")
fused.2c <- create.fused(data.rec=A, data.don=B,
                        mtc.ids=out.2c$mtc.ids, z.vars="acres92")
head(fused.2c)

Run the code above in your browser using DataLab