Learn R Programming

collapse (version 1.1.0)

A6-data-transformations: collapse Data Transformations

Description

collapse provides an ensemble of functions to perform common data transformations efficiently and user friendly:

  • dapply applies functions to rows or columns of matrices and data.frame's.

  • BY is an S3 generic for Split-Apply-Combine computing and can perform aggregation as well as grouped transformations. (for aggregation please also see collap and Fast Statistical Functions).

  • TRA is an S3 generic to efficiently perform (groupwise) replacement and sweeping out of statistics. Supported operations are:

    Integer-id String-id Description
    1 "replace_fill" replace and overwrite missing values
    2 "replace" replace but preserve missing values
    3 "-" subtract
    4 "-+" subtract group-statistics but add group-frequency weighted average of group statistics
    5 "/" divide
    6 "%" compute percentages
    7 "+" add
    8 "*" multiply
    9 "%%" modulus

    All of collapse's Fast Statistical Functions have a built-in TRA argument for faster access (i.e. you can compute (groupwise) statistics and use them to transform your data with a single function call).

  • fscale/STD is an S3 generic to perform (groupwise and / or weighted) scaling / standardizing of data and is orders of magnitude faster than base::scale.

  • fwithin/W is an S3 generic to efficiently perform (groupwise and / or weighted) within-transformations / demeaning / centering of data. Similarly fbetween/B computes (groupwise and / or weighted) between-transformations / averages.

  • fHDwithin/HDW, shorthand for 'higher-dimensional within transform', is an S3 generic to efficiently center data on multiple groups and partial-out linear models (possibly involving many levels of fixed effects and interactions). In other words, fHDwithin/HDW efficiently computes residuals from (potentially complex) linear models. Similarly fHDbetween/HDB, shorthand for 'higher-dimensional between transformation', computes the corresponding means or fitted values.

  • fFtest is a fast implementation of the R-Squared based F-test, to test exclusion restrictions on linear models potentially involving multiple large factors (fixed effects). It internally utilizes fHDwithin to project out factors while counting the degrees of freedom.

  • flag/L/F, fdiff/D and fgrowth/G are S3 generics to compute sequences of lags / leads and suitably lagged and iterated differences and growth rates on time-series and panel data. More in Time-Series and Panel-Series.

  • STD, W, B, HDW, HDB, L, D and G are parsimonious wrappers around the f- functions above representing the corresponding transformation 'operators'. They have additional capabilities when applied to data-frames (i.e. variable selection, formula input, auto-renaming and id-variable preservation), and are easier to employ in regression formulas, but are otherwise identical in functionality.

Arguments

Table of Functions

Function / S3 Generic Methods Description
dapply No methods, works with matrices and data frames apply functions to rows or columns
BY default, matrix, data.frame, grouped_df Split-Apply-Combine computing
TRA default, matrix, data.frame, grouped_df replace and sweep out statistics
fscale/STD default, matrix, data.frame, pseries, pdata.frame, grouped_df scale / standardize data
fwithin/W default, matrix, data.frame, pseries, pdata.frame, grouped_df demean / center data
fbetween/B default, matrix, data.frame, pseries, pdata.frame, grouped_df compute means / average data
fHDwithin/HDW default, matrix, data.frame, pseries, pdata.frame high-dimensional centering and lm residuals
fHDbetween/HDB default, matrix, data.frame, pseries, pdata.frame high-dimensional averages and lm fitted values
fFtest No methods, it's a standalone test to which data needs to be supplied. fast F-test of exclusion restrictions on linear models (involving factors)
flag/L/F default, matrix, data.frame, pseries, pdata.frame, grouped_df (sequences of) lags / leads
fdiff/D default, matrix, data.frame, pseries, pdata.frame, grouped_df (sequences of lagged/leaded and iterated) differences

See Also

Collapse Overview, Fast Statistical Functions, collap, Time-Series and Panel-Series