Learn R Programming

REndo (version 2.4.10)

copulaCorrection: Fitting Linear Models Endogenous Regressors using Gaussian Copula

Description

Fits linear models with continuous or discrete endogenous regressors (or a mixture of both) using Gaussian copulas, as presented in Park and Gupta (2012). This is a statistical technique to address the endogeneity problem where no external instrumental variables are needed. The important assumption of the model is that the endogenous variables should NOT be normally distributed, if continuous, preferably with a skewed distribution. The corrections proposed by Qian, Koschmann, and Xie (2024, p.19-22) are implemented. These mitigate the bias of the original paper for small and moderate sample sizes.

Usage

copulaCorrection(formula, data, num.boots = 1000, verbose = TRUE, ...)

Value

For all cases, an object of classes rendo.copula.correction, rendo.boots, and rendo.base is returned which is a list and contains the following components:

formula

The formula given to specify the fitted model.

terms

The terms object used for model fitting.

model

The model.frame used for model fitting.

coefficients

A named vector of all coefficients resulting from model fitting.

names.main.coefs

a vector specifying which coefficients are from the model. For internal usage.

names.vars.continuous

The names of the continuous endogenous regressors.

names.vars.discrete

The names of the discrete endogenous regressors.

fitted.values

Fitted values at the found solution.

residuals

The residuals at the found solution.

boots.params

The bootstrapped coefficients.

For the case of a single continuous endogenous regressor, the returned object further contains the following components:

start.params

A named vector with the initial set of parameters used to optimize the log-likelihood function.

res.optimx

The result object returned by the function optimx after optimizing the log-likelihood function.

For all other cases, the returned object further contains the following component:

res.lm.real.data

The linear model fitted on the original data together with generated p.star data.

The function summary can be used to obtain and print a summary of the results. Depending on the returned object, the generic accessor functions coefficients, fitted.values, residuals, vcov, logLik, AIC, BIC, and nobs are available.

Arguments

formula

A symbolic description of the model to be fitted. See the "Details" section for the exact notation.

data

A data.frame containing the data of all parts specified in the formula parameter.

num.boots

Number of bootstrapping iterations. Defaults to 1000.

verbose

Show details about the running of the function.

...

Arguments for the log-likelihood optimization function in the case of a single continuous endogenous regressor. Ignored with a warning otherwise.

start.params

A named vector containing a set of parameters to use in the first optimization iteration. The names have to correspond exactly to the names of the components specified in the formula parameter. If not provided, a linear model is fitted to derive them.

optimx.args

A named list of arguments which are passed to optimx. This allows users to tweak optimization settings to their liking.

Details

Method

The underlying idea of the joint estimation method is that using information contained in the observed data, one selects marginal distributions for the endogenous regressor and the structural error term, respectively. Then, the copula model enables the construction of a flexible multivariate joint distribution allowing a wide range of correlations between the two marginals.

Consider the model:


Yt01Pt2Xtt

where \(t=1,..,T\) indexes either time or cross-sectional units, Yt is a \(1x1\) response variable, Xt is a \(kxn\) exogenous regressor, Pt is a \(kx1\) continuous endogenous regressor, εt is a normally distributed structural error term with mean zero and E(ε2)=σε2, \(\alpha\) and \(\beta\) are model parameters.

The marginal distribution of the endogenous regressor Pt is obtained using the Epanechnikov kernel density estimator (Epanechnikov, 1969), as below:


ĥ(p)=1/(T·b) ∑(K·((p-Pt)/b))

where Pt is the endogenous regressor, K(x)=0.75·(1-x2)·I(|x|<=1) and="" the="" bandwidth="" \(b\)="" is="" one="" proposed="" by="" silverman="" (1986),="" equal="" to="" b="0.9·T-1.5·min(s, IQR/1.34). \(IQR\) is the interquartile range while \(s\) is the data sample standard deviation and \(T\) is the number of time periods observed in the data. After obtaining the joint distribution of the error term and the continuous endogenous regressor, the model parameters are estimated using maximum likelihood estimation.

The additional parameters used during model fitting and printed in summary hence are:

rho

The correlation between the endogenous regressor and the error.

sigma

The variance of the model's error.

With more than one continuous endogenous regressor or an endogenous discrete regressor, an alternative approach to the estimation using Gaussian copula should be applied. This approach is similar to the control function approach (Petrin and Train, 2010). The core idea is to apply OLS estimation on the original set of explanatory variables in the model equation above, plus an additional regressor Pt*=Φ-1(H(Pt)). Here, H(Pt) is the marginal distribution of the endogenous regressor \(P\). Including this regressor solves the correlation between the endogenous regressor and the structural error, \(\epsilon\), OLS providing consistent parameter estimates. Due to identification problems, the discrete endogenous regressor cannot have a binomial distribution.

Hence, only in the case of a single continuous endogenous regressor maximum likelihood estimation is used. In all other cases, augmented OLS based on Gaussian copula is applied. This includes cases of multiple endogenous regressors of both discrete and continuous distributions.

In the case of discrete endogenous regressors, a random seed needs to be assigned because the marginal distribution function of the endogenous regressor is a step function in this case. This means that the value of P* lies between 2 values, Φ-1(H(Pt-1)) and Φ-1(H(Pt)). However, the reported upper and lower bounds of the 95% bootstrapped confidence interval gives indication of the variance of the estimates.

Since the inference procedure in both cases, augmented OLS and maximum likelihood, occurs in two stages (first the empirical distribution of the endogenous regressor is computed and then used in constructing the likelihood function), the standard errors are not correct. Therefore, in both cases, the standard errors and the confidence intervals are obtained based on the sampling distributions resulted from bootstrapping. Since the distribution of the bootstrapped parameters is highly skewed, we report the percentile confidence intervals. Moreover, the variance-covariance matrix is also computed based on the bootstrapped parameters, and not based on the Hessian.

Formula parameter

The formula argument follows a two part notation:

A two-sided formula describing the model (e.g. y ~ X1 + X2 + P) to be estimated and a second right-hand side part in which the endogenous regressors and their distributional assumptions are indicated (e.g. continuous(P)). These two parts are separated by a single vertical bar (|). In the second part, the special functions continuous, discrete, or a combination of both, are used to indicate the endogenous regressors and their respective distribution. Both functions use the ... parameter in which the respective endogenous regressors is specified.

Note that no argument to continuous or discrete is to be supplied as character but as symbols without quotation marks.

See the example section for illustrations on how to specify the formula parameter.

References

Park, S. and Gupta, S., (2012), "Handling Endogenous Regressors by Joint Estimation Using Copulas", Marketing Science, 31(4), 567-86.

Qian, Y., Koschmann, A., and Xie, H. (2024). "A Practical Guide to Endogeneity Correction Using Copulas". National Bureau of Economic Research, w32231.

Epanechnikov V (1969). "Nonparametric Estimation of a Multidimensional Probability Density." Teoriya veroyatnostei i ee primeneniya, 14(1), 156–161.

Silverman B (1986). "Density Estimation for Statistics and Data Analysis". CRC Monographs on Statistics and Applied Probability. London: Chapman & Hall.

Petrin A, Train K (2010). "A Control Function Approach to Endogeneity in Consumer Choice Models." Journal of Marketing Research, 47(1), 3–13.

See Also

summary for how fitted models are summarized

vcov for how the variance-covariance matrix is derived

confint for how confidence intervals are derived

optimx for possible elements of parameter optimx.arg

Examples

Run this code
data("dataCopCont")
data("dataCopCont2")
data("dataCopDis")
data("dataCopDis2")
data("dataCopDisCont")

# \donttest{
if (FALSE) {
# Single continuous: log-likelihood optimization
c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont)
# same as above, with start.parameters and number of bootstrappings
c1 <- copulaCorrection(y~X1+X2+P|continuous(P), num.boots=10, data=dataCopCont,
                       start.params = c("(Intercept)"=1, X1=1, X2=-2, P=-1))

# All following examples fit linear model with Gaussian copulas

# 2 continuous endogenous regressors
c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1, P2),
                        num.boots=10, data=dataCopCont2)
# same as above
c2 <- copulaCorrection(y~X1+X2+P1+P2|continuous(P1)+continuous(P2),
                        num.boots=10, data=dataCopCont2)

# single discrete endogenous regressor
d1 <- copulaCorrection(y~X1+X2+P|discrete(P), num.boots=10, data=dataCopDis)

# two discrete endogenous regressor
d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+discrete(P2),
                        num.boots=10, data=dataCopDis2)
# same as above but less bootstrap runs
d2 <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1, P2), num.boots = 10,
                       data=dataCopDis2)

# single discrete, single continuous
cd <- copulaCorrection(y~X1+X2+P1+P2|discrete(P1)+continuous(P2),
                        num.boots=10, data=dataCopDisCont)

# For single continuous only: use own optimization settings (see optimx())
# set maximum number of iterations to 50'000
res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P),
                           optimx.args = list(itnmax = 50000),
                            num.boots=10, data=dataCopCont)

# print detailed tracing information on progress
 res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P),
                            optimx.args = list(control = list(trace = 6)),
                             num.boots=10, data=dataCopCont)

# use method L-BFGS-B instead of Nelder-Mead and print report every 50 iterations
 res.c1 <- copulaCorrection(y~X1+X2+P|continuous(P),
                            optimx.args = list(method = "L-BFGS-B",
                                               control=list(trace = 2, REPORT=50)),
                             num.boots=10, data=dataCopCont)

# For coef(), the parameter "complete" determines if only the
# main model parameters or also the auxiliary coefficients are returned

c1.all.coefs <- coef(res.c1) # also returns rho and sigma
# same as above
c1.all.coefs <- coef(res.c1, complete = TRUE)

# only main model coefs
c1.main.coefs <- coef(res.c1, complete = FALSE)

}# }

Run the code above in your browser using DataLab