latentIV: Fitting Linear Models with one Endogenous Regressor using Latent Instrumental Variables

Description

Fits linear models with one endogenous regressor and no additional explanatory variables using the latent instrumental variable approach presented in Ebbes, P., Wedel, M., Böckenholt, U., and Steerneman, A. G. M. (2005). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The important assumption of the model is that the latent variables are discrete with at least two groups with different means and the structural error is normally distributed.

Usage

latentIV(
  formula,
  data,
  start.params = c(),
  optimx.args = list(),
  verbose = TRUE
)

Value

An object of classes rendo.latent.IV and rendo.base is returned which is a list and contains the following components:

formula: The formula given to specify the fitted model.
terms: The terms object used for model fitting.
model: The model.frame used for model fitting.
coefficients: A named vector of all coefficients resulting from model fitting.
names.main.coefs: a vector specifying which coefficients are from the model. For internal usage.
start.params: A named vector with the initial set of parameters used to optimize the log-likelihood function.
res.optimx: The result object returned by the function optimx after optimizing the log-likelihood function.
hessian: A named, symmetric matrix giving an estimate of the Hessian at the found solution.
m.delta.diag: A diagonal matrix needed when deriving the vcov to apply the delta method on theta5 which was transformed during the LL optimization.
fitted.values: Fitted values at the found optimal solution.
residuals: The residuals at the found optimal solution.

The function summary can be used to obtain and print a summary of the results. The generic accessor functions coefficients, fitted.values, residuals, vcov, confint, logLik, AIC, BIC, case.names, and nobs are available.

Arguments

formula: A symbolic description of the model to be fitted. Of class "formula".
data: A data.frame containing the data of all parts specified in the formula parameter.
start.params: A named vector containing a set of parameters to use in the first optimization iteration. The names have to correspond exactly to the names of the components specified in the formula parameter. If not provided, a linear model is fitted to derive them.
optimx.args: A named list of arguments which are passed to optimx. This allows users to tweak optimization settings to their liking.
verbose: Show details about the running of the function.

Details

Let's consider the model:

Y_t=β₀+αP_t+ε_t

P_t=π'Z_t+ν_t

where \(t = 1,..,T\) indexes either time or cross-sectional units, Y_t is the dependent variable, P_t is a k x 1 continuous, endogenous regressor, ε_t is a structural error term with mean zero and E(ε²)=σ_ε², \(\alpha\) and β₀ are model parameters. Z;_t is a l x 1 vector of instruments, and ν_t is a random error with mean zero and E(ν²)=σ_ν². The endogeneity problem arises from the correlation of \(P\) and ε_t through E(εν)=σ_εν

latentIV considers Z_t' to be a latent, discrete, exogenous variable with an unknown number of groups \(m\) and \(\pi\) is a vector of group means. It is assumed that \(Z\) is independent of the error terms \(\epsilon\) and \(\nu\) and that it has at least two groups with different means. The structural and random errors are considered normally distributed with mean zero and variance-covariance matrix \(\Sigma\):

Σ=(σ_ε², σ₀²,
σ₀², σ_ν²)

The identification of the model lies in the assumption of the non-normality of P_t, the discreteness of the unobserved instruments and the existence of at least two groups with different means.

The method has been implemented such that the latent variable has two groups. Ebbes et al.(2005) show in a Monte Carlo experiment that even if the true number of the categories of the instrument is larger than two, estimates are approximately consistent. Besides, overfitting in terms of the number of groups/categories reduces the degrees of freedom and leads to efficiency loss. For a model with additional explanatory variables a Bayesian approach is needed, since in a frequentist approach identification issues appear.

Identification of the parameters relies on the distributional assumptions of the latent instruments as well as that of the endogenous regressor P_t. Specifically, the endogenous regressor should have a non-normal distribution while the unobserved instruments, \(Z\), should be discrete and have at least two groups with different means Ebbes, Wedel, and Böckenholt (2009). A continuous distribution for the instruments leads to an unidentified model, while a normal distribution of the endogenous regressor gives rise to inefficient estimates.

Additional parameters used during model fitting and printed in summary are:

pi1: The instrumental variables \(Z\) are assumed to be divided into two groups. pi1 represents the estimated group mean of the first group.
pi2: The estimated group mean of the second group of the instrumental variables \(Z\).
theta5: The probability of being in the first group of the instruments.
theta6: The variance, σ_ε²
theta7: The covariance, σ_εν
theta8: The variance, σ_ν²

References

Ebbes, P., Wedel,M., Böckenholt, U., and Steerneman, A. G. M. (2005). 'Solving and Testing for Regressor-Error (in)Dependence When no Instrumental Variables are Available: With New Evidence for the Effect of Education on Income'. Quantitative Marketing and Economics, 3:365--392.

Ebbes P., Wedel M., Böckenholt U. (2009). “Frugal IV Alternatives to Identify the Parameter for an Endogenous Regressor.” Journal of Applied Econometrics, 24(3), 446–468.

Examples

Run this code

# \donttest{
data("dataLatentIV")

# function call without any initial parameter values
l  <- latentIV(y ~ P, data = dataLatentIV)
summary(l)

# function call with initial parameter values given by the user
l1 <- latentIV(y ~ P, start.params = c("(Intercept)"=2.5, P=-0.5),
               data = dataLatentIV)
summary(l1)

# use own optimization settings (see optimx())
# set maximum number of iterations to 50'000
l2 <- latentIV(y ~ P, optimx.args = list(itnmax = 50000),
               data = dataLatentIV)

# print detailed tracing information on progress
l3 <- latentIV(y ~ P, optimx.args = list(control = list(trace = 6)),
               data = dataLatentIV)

# use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations
l4 <- latentIV(y ~ P, optimx.args = list(method = "L-BFGS-B", control=list(trace = 2, REPORT=50)),
               data = dataLatentIV)

# read out all coefficients, incl auxiliary coefs
lat.all.coefs <- coef(l4)
# same as above
lat.all.coefs <- coef(l4, complete = TRUE)
# only main model coefs
lat.main.coefs <- coef(l4, complete = FALSE)
# }

Run the code above in your browser using DataLab