felm: Fit a linear model with multiple group fixed effects

Description

'felm' is used to fit linear models with multiple group fixed effects, similarly to lm. It uses the Method of Alternating projections to sweep out multiple group effects from the normal equations before estimating the remaining coefficients with OLS.

This function is intended for use with large datasets with multiple group effects of large cardinality. If dummy-encoding the group effects results in a manageable number of coefficients, you are probably better off by using lm.

Usage

felm(formula, data, iv=NULL, clustervar=NULL, exactDOF=FALSE,
subset, na.action, contrasts=NULL, ...)

Arguments

formula

an object of class '"formula"' (or one that can be coerced to that class: a symbolic description of the model to be fitted. Similarly to 'lm'. See Details.

data

a data frame containing the variables of the model.

a formula describing an instrumented variable. Estimated via two step OLS. Deprecated, replaced by multi part formula specification.

clustervar

a string or factor. Either the name of a variable or a factor, or a list thereof. Used for computing clustered standard errors. Deprecated, replaced by multi part formula specification.

exactDOF

logical. If more than two factors, the degrees of freedom used to scale the covariance matrix (and the standard errors) is normally estimated. Setting exactDOF=TRUE causes felm to attempt to compute it, bu

subset

an optional vector specifying a subset of observations to be used in the fitting process.

na.action

a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The 'factory-fr

contrasts

an optional list. See the contrasts.arg of model.matrix.default.

...

other arguments. The clustervar and iv arguments will be removed from the argument list at a later time, but will continue to be supported in this field. Currently, the only argument supported in this field is the 'logi

Value

felm returns an object of class "felm". It is quite similar to an "lm" object, but not entirely compatible.
The generic summary-method will yield a summary which may be print'ed. The object has some resemblance to the an lm object, and some postprocessing methods designed for lm may happen to work. It may however be necessary to coerce the object to succeed with this.
The "felm" object is a list containing the following fields:
coefficientsa numerical vector. The estimated coefficients.
Nan integer. The number of observations
pan integer. The total number of coefficients, including those projected out.
responsea numerical vector. The response vector.
fitted.valuesa numerical vector. The fitted values.
residualsa numerical vector. The residuals of the full system, with dummies.
r.residualsa numerical vector. Reduced residuals, i.e. the residuals resulting from predicting without the dummies.
cfactorfactor of length N. The factor describing the connected components of the two first terms in the second part of the model formula.
vcva matrix. The variance-covariance matrix.
felist of factors. A list of the terms in the second part of the model formula.
step1list of 'felm' objects for the IV 1. step(s), if used.
iv1fstatnumerical vector. For IV 1. steps, F-value for excluded instruments, the number of parameters in restricted model and in the unrestricted model.
Xmatrix. The expanded data matrix, i.e. from the first part of the formula. To save memory with large datasets, it is only included if felm(keepX=TRUE) is specified. Must be included if bccorr is to be used for correcting limited mobility bias.

Details

The formula specification is a response variable followed by a four part formula. The first part consists of ordinary covariates, the second part consists of factors to be projected out. The third part is an IV-specification. The fourth part is a cluster specification for the standard errors. I.e. something like

y ~ x1 + x2 | f1 + f2 |
  (Q|W ~ x3+x4) | clu1 + clu2

where y is the response, x1,x2 are ordinary covariates, f1,f2 are factors to be projected out, Q and W are covariates which are instrumented by x3 and x4, and clu1,clu2 are factors to be used for computing cluster robust standard errors. Parts that are not used should be specified as 0, except if it's at the end of the formula, where they can be omitted. The parentheses are needed in the third part since | has higher precedence than ~.

Interactions between a covariate x and a factor f can be projected out with the syntax x:f. The terms in the second and fourth parts are not treated as ordinary formulas, in particular it is not possible with things like y ~ x1 | x*f, rather one would specify y ~ x1 + x | x:f + f. Note that f:x also works, since R's parser does not keep the order. This means that in interactions, the factor must be a factor, whereas a non-interacted factor will be coerced to a factor. I.e. in y ~ x1 | x:f1 + f2, the f1 must be a factor, whereas it will work as expected if f2 is an integer vector.

In older versions of lfe the syntax was felm(y ~ x1 + x2 + G(f1) + G(f2), iv=list(Q ~ x3+x4, W ~ x3+x4), clustervar=c('clu1','clu2')). This syntax still works.

The standard errors are adjusted for the reduced degrees of freedom coming from the dummies which are implicitly present. In the case of two factors, the exact number of implicit dummies is easy to compute. If there are more factors, the number of dummies is estimated by assuming there's one reference-level for each factor, this may be a slight over-estimation, leading to slightly too large standard errors. Setting exactDOF='rM' computes the exact degrees of freedom with rankMatrix() in package Matrix. Note that version 1.1-0 of Matrix has a bug in rankMatrix() for sparse matrices which may cause it to return the wrong value. A fix is underway.

For the iv-part of the formula, it is only necessary to include the instruments on the right hand side. The other explanatory covariates, from the first and second part of formula, are added automatically in the first stage regressions. See the examples.

The contrasts argument is similar to the one in lm(), it is used for factors in the first part of the formula. The factors in the second part are analyzed as part of a possible subsequent getfe() call.

The old syntax with a single part formula with the G() syntax for the factors to transform away is still supported, as well as the clustervar and iv arguments, but users are encouraged to move to the new multi part formulas as described here. In an upcoming version of lfe, the clustervar and iv arguments will be moved to the ... argument list. In the event that you use these arguments, and rewriting to the new syntax is impractical, you should make sure to name them (i.e. not use them as positional arguments). felm will issue a warning if these two arguments are not named.

Note that the way missing values (NAs) in IV estimations are handled in lfe currently may lead to problems. Missing values are removed independently in the first and second stages. Thus, if the instruments have missing values where the other covariates have not, more observations are removed in the first stage than in the second, leading to problems, confusion and general havoc.

An alternative to clustered standard errors is to project out the cluster factors (put them in the second part of the formula) and use heteroskedastic standard errors.

Note that the F-test which is computed by summary.felm is unreliable for robust standard errors.

References

Cameron, A.C., J.B. Gelbach and D.L. Miller (2011) Robust inference with multiway clustering, Journal of Business & Economic Statistics 29 (2011), no. 2, 238--249. http://dx.doi.org/10.1198/jbes.2010.07136

Examples

Run this code

oldopts <- options(lfe.threads=1)
## create covariates
x <- rnorm(1000)
x2 <- rnorm(length(x))

## individual and firm
id <- factor(sample(20,length(x),replace=TRUE))
firm <- factor(sample(13,length(x),replace=TRUE))

## effects for them
id.eff <- rnorm(nlevels(id))
firm.eff <- rnorm(nlevels(firm))

## left hand side
u <- rnorm(length(x))
y <- x + 0.5*x2 + id.eff[id] + firm.eff[firm] + u

## estimate and print result
est <- felm(y ~ x+x2| id + firm)
summary(est)
## compare with lm
summary(lm(y ~ x + x2 + id + firm-1))


# make an example with 'reverse causation'
# Q and W are instrumented by x3 and the factor x4. Report robust s.e.
x3 <- rnorm(length(x))
x4 <- sample(12,length(x),replace=TRUE)

Q <- 0.3*x3 + x + 0.2*x2 + id.eff[id] + 0.3*log(x4) - 0.3*y + rnorm(length(x),sd=0.3)
W <- 0.7*x3 - 2*x + 0.1*x2 - 0.7*id.eff[id] + 0.8*cos(x4) - 0.2*y+ rnorm(length(x),sd=0.6)

# add them to the outcome
y <- y + Q + W

ivest <- felm(y ~ x + x2 | id+firm | (Q|W ~x3|factor(x4)))
summary(ivest,robust=TRUE)
# compare with the not instrumented fit:
summary(felm(y ~ x + x2 +Q + W |id+firm))
options(oldopts)

Run the code above in your browser using DataLab