fbetween-fwithin-B-W: Fast Between (Averaging) and Within (Centering) Transformations

Description

fbetween and fwithin are S3 generics to efficiently obtain between-transformed (averaged) or within-transformed (demeaned) data. These operations can be performed groupwise and/or weighted. B and W are wrappers around fbetween and fwithin representing the 'between-operator' and the 'within-operator'. B / W provide more flexibility than fbetween / fwithin when applied to data frames (i.e. column subsetting, formula input, auto-renaming and id-variable-preservation capabilities...), but are otherwise identical.

(fbetween and fwithin are simple programmers functions in style of the Fast Statistical Functions while B and W are more practical to use in regression formulas or for ad-hoc computations on data frames.)

Usage

fbetween(x, …)
 fwithin(x, …)
       B(x, …)
       W(x, …)

# S3 method for default
fbetween(x, g = NULL, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for default
fwithin(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, …)
# S3 method for default
B(x, g = NULL, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for default
W(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, …)

# S3 method for matrix
fbetween(x, g = NULL, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for matrix
fwithin(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, …)
# S3 method for matrix
B(x, g = NULL, w = NULL, na.rm = TRUE, fill = FALSE, stub = "B.", …)
# S3 method for matrix
W(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, stub = "W.", …)

# S3 method for data.frame
fbetween(x, g = NULL, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for data.frame
fwithin(x, g = NULL, w = NULL, na.rm = TRUE, mean = 0, …)
# S3 method for data.frame
B(x, by = NULL, w = NULL, cols = is.numeric, na.rm = TRUE,
  fill = FALSE, stub = "B.", keep.by = TRUE, keep.w = TRUE, …)
# S3 method for data.frame
W(x, by = NULL, w = NULL, cols = is.numeric, na.rm = TRUE,
  mean = 0, stub = "W.", keep.by = TRUE, keep.w = TRUE, …)
# Methods for compatibility with plm:

# S3 method for pseries
fbetween(x, effect = 1L, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for pseries
fwithin(x, effect = 1L, w = NULL, na.rm = TRUE, mean = 0, …)
# S3 method for pseries
B(x, effect = 1L, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for pseries
W(x, effect = 1L, w = NULL, na.rm = TRUE, mean = 0, …)

# S3 method for pdata.frame
fbetween(x, effect = 1L, w = NULL, na.rm = TRUE, fill = FALSE, …)
# S3 method for pdata.frame
fwithin(x, effect = 1L, w = NULL, na.rm = TRUE, mean = 0, …)
# S3 method for pdata.frame
B(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = TRUE,
  fill = FALSE, stub = "B.", keep.ids = TRUE, keep.w = TRUE, …)
# S3 method for pdata.frame
W(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = TRUE,
  mean = 0, stub = "W.", keep.ids = TRUE, keep.w = TRUE, …)
# Methods for compatibility with dplyr:

# S3 method for grouped_df
fbetween(x, w = NULL, na.rm = TRUE, fill = FALSE,
         keep.group_vars = TRUE, keep.w = TRUE, …)
# S3 method for grouped_df
fwithin(x, w = NULL, na.rm = TRUE, mean = 0,
        keep.group_vars = TRUE, keep.w = TRUE, …)
# S3 method for grouped_df
B(x, w = NULL, na.rm = TRUE, fill = FALSE,
  stub = "B.", keep.group_vars = TRUE, keep.w = TRUE, …)
# S3 method for grouped_df
W(x, w = NULL, na.rm = TRUE, mean = 0,
  stub = "W.", keep.group_vars = TRUE, keep.w = TRUE, …)

Arguments

a numeric vector, matrix, data.frame, panel-series (plm::pseries), panel-data.frame (plm::pdata.frame) or grouped tibble (dplyr::grouped_df).

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

B and W data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.

a numeric vector of (non-negative) weights. B/W data.frame and pdata.frame methods also allow a one-sided formula i.e. ~ weightcol. The grouped_df (dplyr) method supports lazy-evaluation. See Examples.

cols

data.frame method: Select columns to center/average using a function, column names or indices. Default: All numeric variables. Note: cols is ignored if a two-sided formula is passed to by.

na.rm

logical. skip missing values in x when computing averages. If na.rm = FALSE and a NA or NaN is encountered, the average for that group will be NA, and all data points belonging to that group will also be NA.

effect

plm methods: Select which panel identifier should be used as grouping variable. 1L means first variable in the plm::index, 2L the second etc. if more than one integer is supplied, the corresponding index-variables are interacted.

stub

a prefix or stub to rename all transformed columns. FALSE will not rename columns.

fill

option to fbetween/B: Logical. TRUE will overwrite missing values in x with the respective average. By default missing values in x are preserved.

mean

option to fwithin/W: The mean to center on, default is 0, but a different mean can be supplied and will be added to the data after the centering is performed. A special option when performing grouped centering is mean = "overall.mean". In that case the overall mean of the data will be added after subtracting out group means.

keep.by, keep.ids, keep.group_vars

B and W data.frame, pdata.frame and grouped_df methods: Logical. Retain grouping / panel-identifier columns in the output. For data frames this only works if grouping variables were passed in a formula.

keep.w

B and W data.frame, pdata.frame and grouped_df methods: Logical. Retain column containing the weights in the output. Only works if w is passed as formula / lazy-expression.

…

arguments to be passed to or from other methods.

Value

fbetween/B returns x with every element replaced by its (groupwise) mean (xi.). fwithin/W returns x where every element was subtracted its (groupwise) mean (x - xi. or x - xi. + mean or x - xi. + x..). See Details.

Details

Without groups, fbetween/B replaces all data points in x with their mean or weighted mean (if w is supplied). Similarly fwithin/W subtracts the mean from all data points i.e. centers the data on the mean.

With groups supplied to g, the replacement / centering performed by fbetween/B | fwithin/W becomes groupwise. I like to think of this in terms of panel data: If x is a vector in such a dataset, xit denotes a single data-point belonging to group i in time-period t (t need not be a time-period). Then xi. denotes x, averaged over t. fbetween/B now returns xi. and fwithin/W returns x - xi.. Thus for any data x and any grouping vector g: B(x,g) + W(x,g) = xi. + x - xi. = x. In terms of variance, fbetween/B only retains the variance between group averages, while fwithin/W, by subtracting out group means, only retains the variance within those groups.

The data replacement performed by fbetween/B can keep (default) or overwrite missing values (option fill = TRUE) in x. fwithin/W can center data simply (default), or add back a mean after centering (option mean = value), or add the overall mean in groupwise computations (option mean = "overall.mean"). Let x.. denote the overall mean of x, then fwithin/W with mean = "overall.mean" returns x - xi. + x.. instead of x - xi.. This is useful to get rid of group-differences but preserve the overall level of the data (as simple groupwise centering will set the overall mean of the data to 0, or any other arbitrary value passed to mean). In regression analysis, centering with mean = "overall.mean" will only change the constant term. See Examples.

Examples

Run this code

# NOT RUN {
## Simple centering and averaging
fbetween(mtcars)
B(mtcars)
fwithin(mtcars)
W(mtcars)
fbetween(mtcars) + fwithin(mtcars) == mtcars # This should be true apart from rounding errors

## Groupwise centering and averaging
fbetween(mtcars, mtcars$cyl)
 fwithin(mtcars, mtcars$cyl)
fbetween(mtcars, mtcars$cyl) + fwithin(mtcars, mtcars$cyl) == mtcars

W(wlddev, ~ iso3c, cols = 9:12)    # Center the 4 series in this dataset by country
cbind(get_vars(wlddev,"iso3c"),    # Same thing done manually using fwithin...
      add_stub(fwithin(get_vars(wlddev,9:12), wlddev$iso3c), "W."))

## Using B() and W() in regressions:

# Several ways of running the same regression with cyl-fixed effects
lm(W(mpg,cyl) ~ W(carb,cyl), data = mtcars)                     # Centering each individually
lm(mpg ~ carb, data = W(mtcars, ~ cyl, stub = FALSE))           # Centering the entire data
lm(mpg ~ carb, data = W(mtcars, ~ cyl, stub = FALSE,            # Here only the intercept changes
                        mean = "overall.mean"))
lm(mpg ~ carb + B(carb,cyl), data = mtcars)                     # Procedure suggested by
# ...Mundlak (1978) - partialling out group averages amounts to the same as demeaning the data

# Now with cyl, vs and am fixed effects
lm(W(mpg,list(cyl,vs,am)) ~ W(carb,list(cyl,vs,am)), data = mtcars)
lm(mpg ~ carb, data = W(mtcars, ~ cyl + vs + am, stub = FALSE))
lm(mpg ~ carb + B(carb,list(cyl,vs,am)), data = mtcars)

# Now with cyl, vs and am fixed effects weighted by hp:
lm(W(mpg,list(cyl,vs,am),hp) ~ W(carb,list(cyl,vs,am),hp), data = mtcars)
lm(mpg ~ carb, data = W(mtcars, ~ cyl + vs + am, ~ hp, stub = FALSE))
lm(mpg ~ carb + B(carb,list(cyl,vs,am),hp), data = mtcars)       # Gives a different coefficient!!

# }

Run the code above in your browser using DataLab