fscale-STD: Fast (Grouped, Weighted) Scaling and Centering of Matrix-like Objects

Description

fscale is a generic function to efficiently standardize (scale and center) data. STD is a wrapper around fscale representing the 'standardization operator', with more options than fscale when applied to matrices and data frames. Standardization can be simple or groupwise, ordinary or weighted.

Note: For centering without scaling see fwithin/W, for scaling without centering use fsd(..., TRA = "/").

Usage

fscale(x, …)
   STD(x, …)
# S3 method for default
fscale(x, g = NULL, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for default
STD(x, g = NULL, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for matrix
fscale(x, g = NULL, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for matrix
STD(x, g = NULL, w = NULL, na.rm = TRUE, stable.algo = TRUE,
    stub = "STD.", …)
# S3 method for data.frame
fscale(x, g = NULL, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for data.frame
STD(x, by = NULL, w = NULL, cols = is.numeric, na.rm = TRUE,
    keep.by = TRUE, keep.w = TRUE, stable.algo = TRUE, stub = "STD.", …)
# Methods for compatibility with plm:
# S3 method for pseries
fscale(x, effect = 1L, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for pseries
STD(x, effect = 1L, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for pdata.frame
fscale(x, effect = 1L, w = NULL, na.rm = TRUE, stable.algo = TRUE, …)
# S3 method for pdata.frame
STD(x, effect = 1L, w = NULL, cols = is.numeric, na.rm = TRUE,
    keep.ids = TRUE, keep.w = TRUE, stable.algo = TRUE, stub = "STD.", …)
# Methods for compatibility with dplyr:
# S3 method for grouped_df
fscale(x, w = NULL, na.rm = TRUE, keep.group_vars = TRUE,
       keep.w = TRUE, stable.algo = TRUE, …)
# S3 method for grouped_df
STD(x, w = NULL, na.rm = TRUE, keep.group_vars = TRUE,
    keep.w = TRUE, stable.algo = TRUE, stub = "STD.", …)

Arguments

a numeric vector, matrix, data.frame, panel-series (plm::pseries), panel-data.frame (plm::pdata.frame) or grouped tibble (dplyr::grouped_df).

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

STD data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.

cols

data.frame method: Select columns to scale using a function, column names or indices. Default: All numeric variables. Note: cols is ignored if a two-sided formula is passed to by.

a numeric vector of (non-negative) weights. STD data.frame and pdata.frame methods also allow a one-sided formula i.e. ~ weightcol. The grouped_df (dplyr) method supports lazy-evaluation. See Examples.

na.rm

logical. skip missing values in x or w when computing means and sd's.

effect

plm methods: Select which panel identifier should be used as grouping variable. 1L means first variable in the plm::index, 2L the second etc. if more than one integer is supplied, the corresponding index-variables are interacted.

stub

a prefix or stub to rename all transformed columns. FALSE will not rename columns.

stable.algo

logical. TRUE uses a faster but numerically unstable algorithm to compute standard deviations. The default is Welford's numerically stable online algorithm. See Details.

keep.by, keep.ids, keep.group_vars

data.frame, pdata.frame and grouped_df methods: Logical. Retain grouping / panel-identifier columns in the output. For STD.data.frame this only works if grouping variables were passed in a formula.

keep.w

data.frame, pdata.frame and grouped_df methods: Logical. Retain column containing the weights in the output. Only works if w is passed as formula / lazy-expression.

…

arguments to be passed to or from other methods.

Value

x standardized (mean = 0, sd = 1), grouped by g/by, weighted with w. See Details.

Details

If g = NULL, fscale (column-wise) subtracts the mean or weighted mean (if w is supplied) from all data points in x, and then divides this difference by the standard deviation or frequency-weighted standard deviation (if w is supplied). The result is that all columns in x will have mean 0 and standard deviation 1.

With groups supplied to g, this standardizing becomes groupwise, so that in each group (in each column) the data points will have mean 0 and standard deviation 1.

If na.rm = FALSE and a NA or NaN is encountered, the mean and sd for that group will be NA, and all data points belonging to that group will also be NA in the output.

If na.rm = TRUE, means and sd's are computed (column-wise) on the available data points, and also the weight vector can have missing values. In that case (w also has missing values), the weighted mean an sd are computed on (column-wise) complete.cases(x, w), and x is scaled using these statistics. Note that fscale will not insert a missing value in x if the weight for that value is missing, rather, that value will be scaled using a weighted mean and standard-deviated computed without itself! (The intention here is that a few (randomly) missing weights shouldn't break the computation when na.rm = TRUE, but it is not meant for weight vectors with many missing values. If you don't like this behavior, you should prepare your data using x[is.na(w), ] <- NA, or impute your weight vector for non-missing x).

By default means and standard deviations are computed using Welford's numerically stable online algorithm. If stable.algo = FALSE, a faster but numerically unstable default algorithm is used. See fsd for more details regarding the algorithms.

Examples

Run this code

# NOT RUN {
## Simple Scaling & Centering / Standardizing
fscale(mtcars)             # Doesn't rename columns
STD(mtcars)                # By default adds a prefix
qsu(STD(mtcars))           # See that is works

## Panel-Data
head(fscale(get_vars(wlddev,9:12), wlddev$iso3c))   # Standardizing 4 series within each country
head(STD(wlddev, ~iso3c, cols = 9:12))              # Same thing using STD, id's added
pwcor(fscale(get_vars(wlddev,9:12), wlddev$iso3c))  # Correlaing panel-series after standardizing

## Using plm
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c","year"))
head(STD(pwlddev))                                  # Standardizing all numeric variables by country
head(STD(pwlddev, effect = 2L))                     # Standardizing all numeric variables by year

## Weighted Standardizing
weights = abs(rnorm(nrow(wlddev)))
head(fscale(get_vars(wlddev,9:12), wlddev$iso3c, weights))
head(STD(wlddev, ~iso3c, weights, 9:12))

# Using dplyr
library(dplyr)
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX) %>% STD
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX) %>% STD(weights) # weighted standardizing
wlddev %>% group_by(iso3c) %>% select(PCGDP,LIFEEX,ODA) %>% STD(ODA) # weighting by ODA ->
# ..keeps the weight column unless keep.w = FALSE
# }

Run the code above in your browser using DataLab