demeanlist: Centre vectors on multiple groups

Description

Uses the method of alternating projections to centre a (model) matrix on multiple groups, as specified by a list of factors. This function is called by felm, but it has been made available as standalone in case it's needed. In particular, if one does not need transformations provided by R-formulas but have the covariates present as a matrix or a data.frame, a substantial amount of time can be saved in the centering.

Usage

demeanlist(mtx, fl, icpt = 0L, eps = getOption("lfe.eps"),
  threads = getOption("lfe.threads"), progress = getOption("lfe.pint"),
  accel = getOption("lfe.accel"), randfact = TRUE, means = FALSE,
  weights = NULL, scale = TRUE, na.rm = FALSE, attrs = NULL)

Arguments

mtx

matrix whose columns form vectors to be group-centred. mtx can also be a list of vectors or matrices, such as a data frame.

list of factors defining the grouping structure.

icpt

the position of the intercept, this column is removed from the result matrix.

eps

a tolerance for the centering.

threads

an integer specifying the number of threads to use.

progress

integer. If positive, make progress reports (whenever a vector is centered, but not more often than every progress minutes).

accel

integer. Set to 1 if Gearhart-Koshy acceleration should be done.

randfact

logical. Should the order of the factors be randomized? This may improve convergence.

means

logical. Should the means instead of the demeaned matrix be returned? Setting means=TRUE will return mtx - demeanlist(mtx,...), but without the extra copy.

weights

numeric. For weighted demeaning.

scale

logical. Specify scaling for weighted demeaning.

na.rm

logical which indicates what should happen when the data contain NAs. If TRUE, rows in the input mtx are removed prior to centering. If FALSE, they are kept, leading to entire groups becoming NA in the output.

attrs

list. List of attributes which should be attached to the output. Used internally.

Value

If mtx is a matrix, a matrix of the same shape, possibly with column icpt deleted.

If mtx is a list of vectors and matrices, a list of the same length is returned, with the same vector and matrix-pattern, but the matrices have the column icpt deleted.

If mtx is a 'data.frame', a 'data.frame' with the same names is returned; the icpt argument is ignored.

If na.rm is specified, the return value has an attribute 'na.rm' with a vector of row numbers which has been removed. In case the input is a matrix or list, the same rows are removed from all of them. Note that removing NAs incurs a copy of the input, so if memory usage is an issue and many runs are done, one might consider removing NAs from the data set entirely.

Details

For each column y in mtx, the equivalent of the following centering is performed, with cy as the result.

  
cy <- y; oldy <- y-1
while(sqrt(sum((cy-oldy)**2)) >= eps) {
 oldy <- cy
 for(f in fl) cy <- cy - ave(cy,f)
}

Each factor in fl may contain an attribute 'x' which is a numeric vector of the same length as the factor. The centering is then not done on the means of each group, but on the projection onto the covariate in each group. That is, with a covariate x and a factor f, it is like projecting out the interaction x:f. The 'x' attribute can also be a matrix of column vectors, in this case it can be beneficial to orthogonalize the columns, either with a stabilized Gram-Schmidt method, or with the simple method x %*% solve(chol(crossprod(x))).

The weights argument is used if a weighted projection is computed. If \(W\) is the diagonal matrix with weights on the diagonal, demeanlist computes \(W^{-1} M_{WD} W x\) where \(x\) is the input vector, \(D\) is the matrix of dummies from fl and \(M_{WD}\) is the projection on the orthogonal complement of the range of the matrix \(WD\). It is possible to implement the weighted projection with the 'x' attribute mentioned above, but it is a separate argument for convenience. If scale=FALSE, demeanlist computes \(M_{WD} x\) without any \(W\) scaling. If length(scale) > 1, then scale[1] specifies whether the input should be scaled by \(W\), and scale[2] specifies whether the output should be scaled by \(W^{-1}\). This is just a convenience to save some memory copies in other functions in the package.

Note that for certain large datasets the overhead in felm is large compared to the time spent in demeanlist. If the data are present directly without having to use the formula-interface to felm for transformations etc, it is possible to run demeanlist directly on a matrix or "data.frame" and do the OLS "manually", e.g. with something like cx <- demeanlist(x,...); beta <- solve(crossprod(cx), crossprod(cx,y))

In some applications it is known that a single centering iteration is sufficient. In particular, if length(fl)==1 and there is no interaction attribute x. In this case the centering algorithm is terminated after the first iteration. There may be other cases, e.g. if there is a single factor with a matrix x with orthogonal columns. If you have such prior knowledge, it is possible to force termination after the first iteration by adding an attribute attr(fl, 'oneiter') <- TRUE. Convergence will be reached in the second iteration anyway, but you save one iteration, i.e. you double the speed.

Examples

Run this code

# NOT RUN {
oldopts <- options(lfe.threads=1)
## create a matrix
mtx <- data.frame(matrix(rnorm(999),ncol=3))
# a list of factors
rgb <- c('red','green','blue')
fl <- replicate(4, factor(sample(rgb,nrow(mtx),replace=TRUE)), simplify=FALSE)
names(fl) <- paste('g',seq_along(fl),sep='')
# centre on all means
mtx0 <- demeanlist(mtx,fl)
head(data.frame(mtx0,fl))
# verify that the group means for the columns are zero
lapply(fl, function(f) apply(mtx0,2,tapply,f,mean))
options(oldopts)
# }

Run the code above in your browser using DataLab