RANDwNND.hotdeck: Random Distance hot deck.

Description

This function implements a variant of the distance hot deck method. For each recipient record a subset of of the closest donors is retained and then a donor is selected at random.

Usage

RANDwNND.hotdeck(data.rec, data.don, match.vars=NULL, 
                 don.class=NULL, dist.fun="Manhattan", 
                 cut.don="rot", k=NULL, weight.don=NULL, 
                 keep.t=FALSE, ...)

Value

A R list with the following components:

mtc.ids: A matrix with the same number of rows of data.rec and two columns. The first column contains the row names of the data.rec and the second column contains the row names of the corresponding donors selected from the data.don. When the input matrices do not contain row names, then a numeric matrix with the indexes of the rows is provided.
sum.dist: A matrix with summary statistics concerning the subset of the closest donors. The first three columns report the minimum, the maximum and the standard deviation of the distances among the recipient record and the donors in the subset of the closest donors, respectively. The 4th column reports the cutting distance, i.e. the value of the distance such that donors at a higher distance are discarded. The 5th column reports the distance between the recipient and the donor chosen at random in the subset of the donors.
noad: For each recipient unit, reports the number of donor records in the subset of closest donors.
call: How the function has been called.

Arguments

data.rec

A numeric matrix or data frame that plays the role of recipient. This data frame must contain the variables (columns), specified via match.vars and don.class, that should be used in the matching.

Missing values (NA) are allowed.

data.don

A matrix or data frame that plays the role of donor. This data frame must contain the variables (columns), specified via match.vars and don.class, that should be used in the matching.

match.vars

A character vector with the names of the variables (the columns in both the data frames) that have to be used to compute distances between records (rows) in data.rec and those in data.don. When no matching variables are considered (match.vars=NULL) then all the units in the same donation class are considered as possible donors. Hence one of them is selected at random or with probability proportional to its weight (see argument weight.don). When match.vars=NULL and the donation classes are not created
(don.class=NULL) then all the available records in the data.don are considered as potential donors.

don.class

A character vector with the names of the variables (columns in both the data frames) that have to be used to identify donation classes. In this case the computation of distances is limited to those units in data.rec and data.doc that belong to the same donation class. The case of empty donation classes should be avoided. It would be preferable that variables used to form donation classes are defined as factor.

When not specified (default), no donation classes are used. This may result in a heavy computational effort.

dist.fun

A string with the name of the distance function that has to be used. The following distances can be used: “Manhattan” (aka “City block”; default), “Euclidean”, “Mahalanobis”,“exact” or “exact matching”, “Gower”, “minimax”, “difference”, or one of the distance functions available in the package proxy. Note that the distances are computed using the function dist of the package proxy with the exception of the “Gower” (see function gower.dist for details), “Mahalanobis” (function mahalanobis.dist), “minimax” (see maximum.dist) “difference” case. Note that dist.fun="difference" computes just the difference between the values of the unique numeric matching variable considered; in practice, it should be used when the subset of the donation classes should be formed by comparing the values of the unique matching variable (for further details see the argument cut.don.

By setting dist.fun="ANN" or dist.fun="RANN" it is possible to search for the k nearest neighbours for each recipient record by using the the Approximate Nearest Neighbor (ANN) search as implemented in the function nn2 provided by the package RANN.

When dist.fun="Manhattan", "Euclidean", "Mahalanobis" or "minimax" all the variables in data.rec and data.don must be numeric. On the contrary, when dist.fun="exact" or
dist.fun="exact matching", all the variables in data.rec and data.don will be converted to character and, as far as the distance computation is concerned, they will be treated as categorical nominal variables, i.e. distance is 0 if a couple of units shows the same response category and 1 otherwise.

cut.don

A character string that, jointly with the argument k, identifies the rule to be used to form the subset of the closest donor records.

cut.don="rot": (default) then the number of the closest donors to retain is given by \( \left[ \sqrt{n_{D}} \right]+1\); being \( n_{D} \) the total number of available donors. In this case k must not to be specified.
cut.don="span": the number of closest donors is determined as the proportion k of all the available donors, i.e. \( \left[ n_{D} \times k \right] \). Note that, in this case, \( 0< \code{k} \leq 1 \).
cut.don="exact": the kth closest donors out of the \(n_{D}\) are retained. In this case, \( 0< \code{k} \leq{ n_{D} } \).
cut.don="min": the donors at the minimum distance from the recipient are retained.
cut.don="k.dist": only the donors whose distance from the recipient is less or equal to the value specified with the argument k. Note that in this case it is not possible to use dist.fun="ANN".
cut.don="lt" or cut.don="<": only the donors whose value of the matching variable is smaller than the value of the recipient are retained. Note that in this case it is has to be set dist.fun="difference".
cut.don="le" or cut.don="<=": only the donors whose value of the matching variable is smaller or equal to the value of the recipient are retained. Note that in this case it is has to be set dist.fun="difference".
cut.don="ge" or cut.don=">=": only the donors whose value of the matching variable is greater or equal to the value of the recipient are retained. Note that in this case it is has to be set dist.fun="difference".
cut.don="gt" or cut.don=">": only the donors whose value of the matching variable is greater than the value of the recipient are retained. Note that in this case it is has to be set dist.fun="difference".

k

Depends on the cut.don argument.

weight.don

A character string providing the name of the variable with the weights associated to the donor units in data.don. When this variable is specified, then the selection of a donor among those in the subset of the closest donors is done with probability proportional to its weight (units with larger weight will have a higher chance of being selected). When weight.don=NULL (default) all the units in the subset of the closest donors will have the same probability of being selected.

keep.t

Logical, when donation classes are used by setting keep.t=TRUE prints information on the donation classes being processed (by default keep.t=FALSE).

...

Additional arguments that may be required by gower.dist, by
maximum.dist, by dist or by nn2.

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function finds a donor record for each record in the recipient data set. The donor is chosen at random in the subset of available donors. This procedure is known as random hot deck (cf. Andridge and Little, 2010). In RANDwNND.hotdeck, the number of closest donors retained to form the subset is determined according to criterion specified with the argument cut.don. The selection of the donor among those in the subset is carried out with equal probability (weight.don=NULL) or with probability proportional to a weight associated to the donors, specified via the weight.don argument. This procedure is is known as weighted random hot deck (cf. Andridge and Little, 2010).

The search for the subset of the closest donors can be speed up by using the Approximate Nearest Neighbor search as implemented in the function nn2 provided by the package RANN. Note that this search can be used in all the cases with the exception of cut.don="k.dist".

Note that the same donor can be used more than once.

This function can also be used to impute missing values in a data set. In this case data.rec is the part of the initial data set that contains missing values; on the contrary, data.don is the part of the data set without missing values. See R code in the Examples for details.

References

Andridge, R.R., and Little, R.J.A. (2010) “A Review of Hot Deck Imputation for Survey Non-response”. International Statistical Review, 78, 40--64.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Rodgers, W.L. (1984). “An evaluation of statistical matching”. Journal of Business and Economic Statistics, 2, 91--102.

Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59--79.

Examples

Run this code


data(samp.A, samp.B, package="StatMatch") #loads data sets
?samp.A
?samp.B


# samp.A plays the role of recipient
# samp.B plays the role of donor
# find a donor in the in the same region ("area5") and with the same
# gender ("sex"), then only the closest k=20 donors in terms of 
# "age" are cnsidered and one of them is picked up at random

out.RND.1 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B,
                              don.class=c("area5", "sex"), dist.fun="ANN",
                              match.vars="age", cut.don="exact", k=20)

# create the synthetic (or fused) data.frame:
# fill in "labour5" in A
fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.RND.1$mtc.ids, z.vars="labour5")
head(fused.1)

# weights ("ww") are used in selecting the donor in the final step

out.RND.2 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B,
                              don.class=c("area5", "sex"), dist.fun="ANN",
                              match.vars="age", cut.don="exact", 
                              k=20, weight.don="ww")
fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.RND.2$mtc.ids, z.vars="labour5")
head(fused.2)

# find a donor in the in the same region ("area5") and with the same
# gender ("sex"), then only the donors with "age" <= to the age of the
# recipient are considered,
# then one of them is picked up at random

out.RND.3 <- RANDwNND.hotdeck(data.rec=samp.A, data.don=samp.B,
                              don.class=c("area5", "sex"), dist.fun="diff",
                              match.vars="age", cut.don="<=")

# create the synthetic (or fused) data.frame:
# fill in "labour5" in A
fused.3 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.RND.3$mtc.ids, z.vars="labour5")
head(fused.3)

# Example of Imputation of missing values
# introducing missing vales in iris
ir.mat <- iris
miss <- rbinom(nrow(iris), 1, 0.3)
ir.mat[miss==1,"Sepal.Length"] <- NA
iris.rec <- ir.mat[miss==1,-1]
iris.don <- ir.mat[miss==0,]

#search for NND donors
imp.RND <- RANDwNND.hotdeck(data.rec=iris.rec, data.don=iris.don,
                            match.vars=c("Sepal.Width","Petal.Length", "Petal.Width"),
                            don.class="Species")

# imputing missing values
iris.rec.imp <- create.fused(data.rec=iris.rec, data.don=iris.don,
                             mtc.ids=imp.RND$mtc.ids, z.vars="Sepal.Length")

# rebuild the imputed data.frame
final <- rbind(iris.rec.imp, iris.don)
head(final)

Run the code above in your browser using DataLab