Learn R Programming

StatMatch (version 1.2.0)

mixed.mtc: Statistical Matching via Mixed Methods

Description

This function implements some mixed methods to perform statistical matching between two data sources.

Usage

mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML",
           rho.yz=0, micro=FALSE, constr.alg="Hungarian")

Arguments

data.rec
A matrix or data frame that plays the role of recipient in the statistical matching application. This data set must contain all variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments mat
data.don
A matrix or data frame that plays the role of donor in the statistical matching application. This data set must contain all the numeric variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments
match.vars
A character vector with the names of the common variables (the columns in both the data frames) to be used as matching variables (X).
y.rec
A character vector with the name of the target variable Y that is observed only for units in data.rec. Only one continuous variable is allowed.
z.don
A character vector with the name of the target variable Z that is observed only for units in data.don. Only one continuous variable is allowed.
method
A character vector that identifies the method that should be used to estimate the parameters of the regression models: Y vs. X and Z vs. X. Maximum Likelihood method is used when method="ML" (default); on the contrary, when
rho.yz
A numeric value representing a guess for the correlation among the Y (y.rec) and the Z variable (z.don) that are not jointly observed. When method="MS" then the argument cor.yz must specify the value of
micro
Logical. When micro=FALSE (default) only the parameter estimates are returned. On the contrary, when micro=TRUE the function returns also data.rec filled in with the values for the variable Z. The donors for filli
constr.alg
A string that has to be specified when micro=TRUE, in order to solve the transportation problem involved by the constrained distance hot deck method. Two choices are available: lpSolve and Hungarian. In the

Value

  • A list with a varying number of components depending on the values of the arguments method and rho.yz.
  • muThe estimated mean vector.
  • vcThe estimated variance--covariance matrix.
  • corThe estimated correlation matrix.
  • res.varA vector with estimates of the residual variances $\sigma_{Y|Z\bf{X}}$ and $\sigma_{Z|Y\bf{X}}$.
  • start.prho.yzIt is the initial guess for the partial correlation coefficient $\rho_{YZ|\bf{X}}$ passed in input via the rho.yz argument when method="ML".
  • rho.yzReturned in output only when method="MS". It is a vector with four values: the initial guess for $\rho_{YZ}$; the lower and upper bounds for $\hat{\rho}_{YZ}$ in the statistical matching framework given the correlation coefficients among Y and X and the correlation coefficients among Z and X estimated from the available data; and, finally, the closest admissible value used in computations instead of the initial rho.yz that resulted not coherent with the others correlation coefficients estimated from the available data.
  • phiWhen method="MS". Estimates of the $\phi$ terms introduced by Moriarity and Scheuren (2001 and 2003).
  • filled.recThe data.rec filled in with the values of Z. It is returned only when micro=TRUE.
  • mtc.idswhen micro=TRUE. This is a matrix with the same number of rows of data.rec and two columns. The first column contains the row names of the data.rec and the second column contains the row names of the corresponding donors selected from the data.don. When the input matrices do not contain row names, a numeric matrix with the indexes of the rows is provided.
  • dist.rdA vector with the distances among each recipient unit and the corresponding donor, returned only in case micro=TRUE.
  • callHow the function has been called.

Details

This function implements some mixed methods to perform statistical matching. A mixed method consists of two steps:

(1) adoption of a parametric model for the joint distribution of $\left( \mathbf{X},Y,Z \right)$ and estimation of its parameters;

(2) derivation of a complete synthetic data set (recipient data set filled in with values for the Z variable) using a nonparametric approach.

In this case, as far as (1) is concerned, it is assumed that $\left( \mathbf{X},Y,Z \right)$ follows a multivariate normal distribution. Please note that if some of the X are categorical, then they are recoded into dummies before starting with the estimation. In such a case the assumption of multivariate normal distribution may be questionable.

The whole procedure is based on the imputation method known as predictive mean matching. The procedure consists of three steps:

step 1a) Regression step: the two linear regression models Y vs. X and Z vs. X are considered and their parameters are estimated.

step 1b) Computation of intermediate values. For the units in data.rec the following intermediate values are derived:

$$\tilde{z}_{a} = \hat{\alpha}_{Z} + \hat{\beta}_{Z\bf{X}} \mathbf{x}_a + e_a$$

for each $a=1,\ldots,n_{A}$, being $n_A$ the number of units in data.rec (rows of data.rec). Note that, $e_a$ is a random draw from the multivariate normal distribution with zero mean and estimated residual variance $\hat{\sigma}_{Z|\bf{X}}$.

Similarly, for the units in data.don the following intermediate values are derived:

$$\tilde{y}_{b} = \hat{\alpha}_{Y} + \hat{\beta}_{Y\bf{X}} \mathbf{x}_b + e_b$$

for each $b=1,\ldots,n_{B}$, being $n_B$ the number of units in data.don (rows of data.don). $e_b$ is a random draw from the multivariate normal distribution with zero mean and estimated residual variance $\hat{\sigma}_{Y|\bf{X}}$.

step 2) Matching step. For each observation (row) in data.rec a donor is chosen in data.don through a nearest neighbor constrained distance hot deck procedure. The distances are computed between $\left( y_a, \tilde{z}_a \right)$ and $\left( \tilde{y}_b, z_b \right)$ using Mahalanobis distance.

For further details see Sections 2.5.1 and 3.6.1 in D'Orazio et al. (2006).

In step 1a) the parameters of the regression model can be estimated by means of the Maximum Likelihood method (method="ML") (see D'Orazio et al., 2006, pp. 19--23,73--75) or, using the Moriarity and Scheuren (2001 and 2003) approach (method="MS") (see also D'Orazio et al., 2006, pp. 75--76). The two estimation methods are compared in D'Orazio et al. (2005).

When method="MS", if the value specified for the argument rho.yz is not compatible with the other correlation coefficients estimated from the data, then it is substituted with the closest value compatible with the other estimated coefficients. When micro=FALSE only the estimation of the parameters is performed (step 1a). Otherwise, (micro=TRUE) the whole procedure is carried out.

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2005). A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study, Contributi, 2005/10, Istituto Nazionale di Statistica, Rome. http://www.istat.it/dati/pubbsci/contributi/Contributi/contr_2005/2005_10.pdf

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. http://CRAN.R-project.org/package=clue.

Moriarity, C., and Scheuren, F. (2001). Statistical matching: a paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics, 17, 407--422. http://www.jos.nu/Articles/abstract.asp?article=173407

Moriarity, C., and Scheuren, F. (2003). A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputation, Journal of Business and Economic Statistics, 21, 65--73.

See Also

NND.hotdeck, mahalanobis.dist

Examples

Run this code
# reproduce the statistical matching framework
# starting from the iris data.frame
set.seed(98765)
pos <- sample(1:150, 50, replace=FALSE)
ir.A <- iris[pos,c(1,3:5)]
ir.B <- iris[-pos, 2:5]

xx <- intersect(colnames(ir.A), colnames(ir.B))
xx  # common variables

# ML estimation method under CIA ((rho_YZ|X=0));
# only parameter estimates (micro=FALSE)
# only continuous matching variables
xx.mtc <- c("Petal.Length", "Petal.Width")
mtc.1 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width")

# estimated correlation matrix
mtc.1$cor 

# ML estimation method under CIA ((rho_YZ|X=0));
# only parameter estimates (micro=FALSE)
# categorical variable 'Species' used as matching variable

xx.mtc <- xx
mtc.2 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width")

# estimated correlation matrix
mtc.2$cor 


# ML estimation method with partial correlation coefficient
# set equal to 0.5 (rho_YZ|X=0.5)
# only parameter estimates (micro=FALSE)

mtc.3 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    rho.yz=0.5)

# estimated correlation matrix
mtc.3$cor 

# ML estimation method with partial correlation coefficient
# set equal to 0.5 (rho_YZ|X=0.5)
# with imputation step (micro=TRUE)

mtc.4 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    rho.yz=0.5, micro=TRUE, constr.alg="Hungarian")

# first rows of data.rec filled in with z
head(mtc.4$filled.rec)

#
# Moriarity and Scheuren estimation method under CIA;
# only with parameter estimates (micro=FALSE)
mtc.5 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    method="MS")

# the starting value of rho.yz and the value used
# in computations
mtc.5$rho.yz

# estimated correlation matrix
mtc.5$cor 

# Moriarity and Scheuren estimation method
# with correlation coefficient set equal to -0.15 (rho_YZ=-0.15)
# with imputation step (micro=TRUE)

mtc.6 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
                    y.rec="Sepal.Length", z.don="Sepal.Width",
                    method="MS", rho.yz=-0.15, 
                    micro=TRUE, constr.alg="lpSolve")

# the starting value of rho.yz and the value used
# in computations
mtc.6$rho.yz

# estimated correlation matrix
mtc.6$cor

# first rows of data.rec filled in with z imputed values
head(mtc.6$filled.rec)

Run the code above in your browser using DataLab