collapse
Data Transformationscollapse
provides an ensemble of functions to perform common data transformations efficiently and user friendly:
dapply
applies functions to rows or columns of matrices and data.frame's.
BY
is an S3 generic for Split-Apply-Combine computing and can perform aggregation as well as grouped transformations. (for aggregation please also see collap
and Fast Statistical Functions).
TRA
is an S3 generic to efficiently perform (groupwise) replacement and sweeping out of statistics.
Supported operations are:
Integer-id | String-id | Description | ||
1 | "replace_fill" | replace and overwrite missing values | ||
2 | "replace" | replace but preserve missing values | ||
3 | "-" | subtract | ||
4 | "-+" | subtract group-statistics but add group-frequency weighted average of group statistics | ||
5 | "/" | divide | ||
6 | "%" | compute percentages | ||
7 | "+" | add | ||
8 | "*" | multiply | ||
9 | "%%" | modulus |
All of collapse
's Fast Statistical Functions have a built-in TRA
argument for faster access (i.e. you can compute (groupwise) statistics and use them to transform your data with a single function call).
fscale/STD
is an S3 generic to perform (groupwise and / or weighted) scaling / standardizing of data and is orders of magnitude faster than base::scale
.
fwithin/W
is an S3 generic to efficiently perform (groupwise and / or weighted) within-transformations / demeaning / centering of data. Similarly fbetween/B
computes (groupwise and / or weighted) between-transformations / averages.
fHDwithin/HDW
, shorthand for 'higher-dimensional within transform', is an S3 generic to efficiently center data on multiple groups and partial-out linear models (possibly involving many levels of fixed effects and interactions). In other words, fHDwithin/HDW
efficiently computes residuals from (potentially complex) linear models. Similarly fHDbetween/HDB
, shorthand for 'higher-dimensional between transformation', computes the corresponding means or fitted values.
fFtest
is a fast implementation of the R-Squared based F-test, to test exclusion restrictions on linear models potentially involving multiple large factors (fixed effects). It internally utilizes fHDwithin
to project out factors while counting the degrees of freedom.
flag/L/F
, fdiff/D
and fgrowth/G
are S3 generics to compute sequences of lags / leads and suitably lagged and iterated differences and growth rates on time-series and panel data. More in Time-Series and Panel-Series.
STD, W, B, HDW, HDB, L, D
and G
are parsimonious wrappers around the f-
functions above representing the corresponding transformation 'operators'. They have additional capabilities when applied to data-frames (i.e. variable selection, formula input, auto-renaming and id-variable preservation), and are easier to employ in regression formulas, but are otherwise identical in functionality.
Function / S3 Generic | Methods | Description | ||
dapply |
No methods, works with matrices and data frames | apply functions to rows or columns | ||
BY |
default, matrix, data.frame, grouped_df |
Split-Apply-Combine computing | ||
TRA |
default, matrix, data.frame, grouped_df |
replace and sweep out statistics | ||
fscale/STD |
default, matrix, data.frame, pseries, pdata.frame, grouped_df |
scale / standardize data | ||
fwithin/W |
default, matrix, data.frame, pseries, pdata.frame, grouped_df |
demean / center data | ||
fbetween/B |
default, matrix, data.frame, pseries, pdata.frame, grouped_df |
compute means / average data | ||
fHDwithin/HDW |
default, matrix, data.frame, pseries, pdata.frame |
high-dimensional centering and lm residuals | ||
fHDbetween/HDB |
default, matrix, data.frame, pseries, pdata.frame |
high-dimensional averages and lm fitted values | ||
fFtest |
No methods, it's a standalone test to which data needs to be supplied. | fast F-test of exclusion restrictions on linear models (involving factors) | ||
flag/L/F |
default, matrix, data.frame, pseries, pdata.frame, grouped_df |
(sequences of) lags / leads | ||
fdiff/D |
default, matrix, data.frame, pseries, pdata.frame, grouped_df |
(sequences of lagged/leaded and iterated) differences |
Collapse Overview, Fast Statistical Functions, collap
, Time-Series and Panel-Series