mixed.mtc: Statistical Matching via Mixed Methods

Description

This function implements some mixed methods to perform statistical matching between two data sources.

Usage

mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML",
           rho.yz=NULL, micro=FALSE, constr.alg="Hungarian")

Value

A list with a varying number of components depending on the values of the arguments method and rho.yz.

mu: The estimated mean vector.
vc: The estimated variance--covariance matrix.
cor: The estimated correlation matrix.
res.var: A vector with estimates of the residual variances $\sigma_{Y|Z\bf{X}}$ and $\sigma_{Z|Y\bf{X}}$.
start.prho.yz: It is the initial guess for the partial correlation coefficient $\rho_{YZ|\bf{X}}$ passed in input via the rho.yz argument when method="ML".
rho.yz: Returned in output only when method="MS". It is a vector with four values: the initial guess for $\rho_{YZ}$; the lower and upper bounds for $\hat{\rho}_{YZ}$ in the statistical matching framework given the correlation coefficients between Y and X and the correlation coefficients between Z and X estimated from the available data; and, finally, the closest admissible value used in computations instead of the initial rho.yz that resulted not coherent with the others correlation coefficients estimated from the available data.
phi: When method="MS". Estimates of the $\phi$ terms introduced by Moriarity and Scheuren (2001 and 2003).
filled.rec: The data.rec filled in with the values of Z. It is returned only when
micro=TRUE.
mtc.ids: when micro=TRUE. This is a matrix with the same number of rows of data.rec and two columns. The first column contains the row names of the data.rec and the second column contains the row names of the corresponding donors selected from the data.don. When the input matrices do not contain row names, a numeric matrix with the indexes of the rows is provided.
dist.rd: A vector with the distances between each recipient unit and the corresponding donor, returned only in case micro=TRUE.
call: How the function has been called.

Arguments

data.rec

A matrix or data frame that plays the role of recipient in the statistical matching application. This data set must contain all variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments
match.vars and y.rec. Note that continuous variables are expected, if there are some categorical variables they are re-coded into dummies. Missing values (NA) are not allowed.

data.don

A matrix or data frame that plays the role of donor in the statistical matching application. This data set must contain all the numeric variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments match.vars and z.don. Note that continuous variables are expected, if there are some categorical variables they are re-coded into dummies. Missing values (NA) are not allowed.

match.vars

A character vector with the names of the common variables (the columns in both the data frames) to be used as matching variables (X).

y.rec

A character vector with the name of the target variable Y that is observed only for units in data.rec. Only one continuous variable is allowed.

z.don

A character vector with the name of the target variable Z that is observed only for units in data.don. Only one continuous variable is allowed.

method

A character vector that identifies the method that should be used to estimate the parameters of the regression models: Y vs. X and Z vs. X. Maximum Likelihood method is used when method="ML" (default); on the contrary, when method="MS" the parameters are estimated according to approach proposed by Moriarity and Scheuren (2001 and 2003). See Details for further information.

rho.yz

A numeric value representing a guess for the correlation between the Y (y.rec) and the Z variable (z.don) that are not jointly observed. When method="MS" then the argument cor.yz must specify the value of the correlation coefficient $\rho_{YZ}$; on the contrary, when method="ML", it must specify the partial correlation coefficient between Y and Z given X ($\rho_{YZ|\bf{X}}$).

By default (rho.yz=NULL). In practice, in absence of auxiliary information concerning the correlation coefficient or the partial correlation coefficient, the statistical matching is carried out under the assumption of independence between Y and Z given X (Conditional Independence Assumption, CIA ), i.e. $\rho_{YZ|\bf{X}}=0$.

micro

Logical. When micro=FALSE (default) only the parameters' estimates are returned. On the contrary, when micro=TRUE the function returns also data.rec filled in with the values for the variable Z. The donors for filling in Z in data.rec are identified using a constrained distance hot deck method. In this case, the number of units (rows) in data.don must be grater or equal to the number of units (rows) in data.rec. See next argument and Details for further information.

constr.alg

A string that has to be specified when micro=TRUE, in order to solve the transportation problem involved by the constrained distance hot deck method. Two choices are available: “lpSolve” and “Hungarian”. In the first case,
constr.alg="lpSolve", the transportation problem is solved by means of the function lp.transport available in the package lpSolve. When
constr.alg="Hungarian" (default) the transportation problem is solved using the Hungarian method implemented in the function solve_LSAP available in the package clue (Hornik, 2012). Note that Hungarian algorithm is more efficient and requires less processing time.

Author

Marcello D'Orazio mdo.statmatch@gmail.com

Details

This function implements some mixed methods to perform statistical matching. A mixed method consists of two steps:

(1) adoption of a parametric model for the joint distribution of $ \left( \mathbf{X},Y,Z \right) $ and estimation of its parameters;

(2) derivation of a complete “synthetic” data set (recipient data set filled in with values for the Z variable) using a nonparametric approach.

In this case, as far as (1) is concerned, it is assumed that $ \left( \mathbf{X},Y,Z \right) $ follows a multivariate normal distribution. Please note that if some of the X are categorical, then they are recoded into dummies before starting with the estimation. In such a case, the assumption of multivariate normal distribution may be questionable.

The whole procedure is based on the imputation method known as predictive mean matching. The procedure consists of three steps:

step 1a) Regression step: the two linear regression models Y vs. X and Z vs. X are considered and their parameters are estimated.

step 1b) Computation of intermediate values. For the units in data.rec the following intermediate values are derived:

$$ \tilde{z}_{a} = \hat{\alpha}_{Z} + \hat{\beta}_{Z\bf{X}} \mathbf{x}_a + e_a $$

for each $a=1,\ldots,n_{A}$, being $n_A$ the number of units in data.rec (rows of data.rec). Note that, $e_a$ is a random draw from the multivariate normal distribution with zero mean and estimated residual variance $\hat{\sigma}_{Z|\bf{X}}$.

Similarly, for the units in data.don the following intermediate values are derived:

$$ \tilde{y}_{b} = \hat{\alpha}_{Y} + \hat{\beta}_{Y\bf{X}} \mathbf{x}_b + e_b $$

for each $b=1,\ldots,n_{B}$, being $n_B$ the number of units in data.don (rows of data.don). $e_b$ is a random draw from the multivariate normal distribution with zero mean and estimated residual variance $\hat{\sigma}_{Y|\bf{X}}$.

step 2) Matching step. For each observation (row) in data.rec a donor is chosen in data.don through a nearest neighbor constrained distance hot deck procedure. The distances are computed between $\left( y_a, \tilde{z}_a \right)$ and $\left( \tilde{y}_b, z_b \right)$ using Mahalanobis distance.

For further details see Sections 2.5.1 and 3.6.1 in D'Orazio et al. (2006).

In step 1a) the parameters of the regression model can be estimated by means of the Maximum Likelihood method (method="ML") (see D'Orazio et al., 2006, pp. 19--23,73--75) or, using the Moriarity and Scheuren (2001 and 2003) approach (method="MS") (see also D'Orazio et al., 2006, pp. 75--76). The two estimation methods are compared in D'Orazio et al. (2005).

When method="MS", if the value specified for the argument rho.yz is not compatible with the other correlation coefficients estimated from the data, then it is substituted with the closest value compatible with the other estimated coefficients.

When micro=FALSE only the estimation of the parameters is performed (step 1a). Otherwise,
(micro=TRUE) the whole procedure is carried out.

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2005). “A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study”, Contributi, 2005/10, Istituto Nazionale di Statistica, Rome.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. https://CRAN.R-project.org/package=clue.

Moriarity, C., and Scheuren, F. (2001). “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407--422.

Moriarity, C., and Scheuren, F. (2003). “A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputation”, Journal of Business and Economic Statistics, 21, 65--73.

Examples

Run this code


# reproduce the statistical matching framework
# starting from the iris data.frame
suppressWarnings(RNGversion("3.5.0"))
set.seed(98765)
pos <- sample(1:150, 50, replace=FALSE)
ir.A <- iris[pos,c(1,3:5)]
ir.B <- iris[-pos, 2:5]

xx <- intersect(colnames(ir.A), colnames(ir.B))
xx  # common variables

# ML estimation method under CIA ((rho_YZ|X=0));
# only parameter estimates (micro=FALSE)
# only continuous matching variables
xx.mtc <- c("Petal.Length", "Petal.Width")
mtc.1 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width")

# estimated correlation matrix
mtc.1$cor 

# ML estimation method under CIA ((rho_YZ|X=0));
# only parameter estimates (micro=FALSE)
# categorical variable 'Species' used as matching variable

xx.mtc <- xx
mtc.2 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width")

# estimated correlation matrix
mtc.2$cor 


# ML estimation method with partial correlation coefficient
# set equal to 0.5 (rho_YZ|X=0.5)
# only parameter estimates (micro=FALSE)

mtc.3 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    rho.yz=0.5)

# estimated correlation matrix
mtc.3$cor 

# ML estimation method with partial correlation coefficient
# set equal to 0.5 (rho_YZ|X=0.5)
# with imputation step (micro=TRUE)

mtc.4 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    rho.yz=0.5, micro=TRUE, constr.alg="Hungarian")

# first rows of data.rec filled in with z
head(mtc.4$filled.rec)

#
# Moriarity and Scheuren estimation method under CIA;
# only with parameter estimates (micro=FALSE)
mtc.5 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    method="MS")

# the starting value of rho.yz and the value used
# in computations
mtc.5$rho.yz

# estimated correlation matrix
mtc.5$cor 

# Moriarity and Scheuren estimation method
# with correlation coefficient set equal to -0.15 (rho_YZ=-0.15)
# with imputation step (micro=TRUE)

mtc.6 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    method="MS", rho.yz=-0.15, 
                    micro=TRUE, constr.alg="lpSolve")

# the starting value of rho.yz and the value used
# in computations
mtc.6$rho.yz

# estimated correlation matrix
mtc.6$cor

# first rows of data.rec filled in with z imputed values
head(mtc.6$filled.rec)

Run the code above in your browser using DataLab