TRA: Transform Data by (Groupwise) Replacing or Sweeping out Statistics

Description

TRA is an S3 generic that efficiently transforms data by either (column-wise) replacing data values with supplied statistics or sweeping the statistics out of the data. TRA supports grouped operations and data frame's, and is thus a generalization of sweep.

Usage

TRA(x, STATS, FUN = "-", ...)
# S3 method for default
TRA(x, STATS, FUN = "-", g = NULL, ...)
# S3 method for matrix
TRA(x, STATS, FUN = "-", g = NULL, ...)
# S3 method for data.frame
TRA(x, STATS, FUN = "-", g = NULL, ...)
# S3 method for grouped_df
TRA(x, STATS, FUN = "-", keep.group_vars = TRUE, ...)

Arguments

a atomic vector, matrix, data frame or grouped tibble (dplyr::grouped_df).

STATS

a matching set of summary statistics computed on x. If g = NULL (no groups), all methods support an atomic vector of statistics of length NCOL(x). The matrix and data.frame methods also support a 1-row matrix or 1-row data.frame/list, respectively. If groups are supplied to g, STATS needs to be of the same type as x and of appropriate dimensions (such that NCOL(x) == NCOL(STATS) and NROW(STATS) matches the number of groups supplied to g i.e. the number of levels if g is a factor, with the first row of STATS corresponding to the first level of g etc...)

FUN

an integer or character string indicating the operation to perform. There are 10 supported operations:

Int.	String	Description
1	"replace_fill"	replace and overwrite missing values
2	"replace"	replace but preserve missing values
3	"-"	subtract (i.e. center)
4	"-+"	subtract group-statistics but add group-frequency weighted average of group statistics (i.e. center on overall average statistic)
5	"/"	divide (i.e. scale, but also changes mean. `fscale` can scale and keep mean)
6	"%"	compute percentages (i.e. divide and multiply by 100)
7	"+"	add
8	"*"	multiply
9	"%%"	modulus (i.e. remainder from division by `STATS`)

a factor, GRP object, atomic vector (internally converted to ordered factor) or a list of vectors / factors (internally converted to a GRP object) used to group x. Number of groups must match rows of STATS. See STATS and Details.

keep.group_vars

grouped_df method: Logical. Remove grouping variables after computation. In contrast to the other methods, TRA.grouped_df matches column names exactly, thus STATS can be any subset of aggregated columns in x in any order, with or without grouping columns. TRA.grouped_df will transform the columns in x with their aggregated versions matched from STATS (ignoring grouping columns found in x or STATS and columns in x not found in STATS), and return x again. If keep.group_vars = FALSE, x is returned again without grouping columns. See Details and Examples.

...

arguments to be passed to or from other methods.

Value

x with columns replaced or swept out using STATS, grouped by g.

Details

Without groups (g = NULL), TRA is nothing more than a column based version of base::sweep, albeit 4-times more efficient on matrices and many times more efficient on data frames. TRA always preserves all attributes of x.

With groups passed to g, TRA expects (and checks for) a set of statistics such that NROW(STATS) equals the number of groups. If this condition is satisfied, TRA will (without further checks) assume that the first row of STATS is the set of statistics computed on the first group of g, the second row on the second group etc. and do groupwise replacing or sweeping out accordingly.

For example Let x = c(1.2, 4.6, 2.5, 9.1, 8.7, 3.3), g is an integer vector in 3 groups g = c(1,3,3,2,1,2) and STATS = fmean(x,g) = c(4.95, 6.20, 3.55). Then out = TRA(x,fmean(x,g),"-",g) = c(-3.75, 1.05, -1.05, 2.90, 3.75, -2.90) (same as fmean(x, g, TRA = "-")) does the equivalent to the following for-loop: for(i in 1:6) out[i] = x[i] - fmean(x,g)[g[i]].

Correct computation requires that g as used in fmean and g passed to TRA are exactly the same vector. Using g = c(1,3,3,2,1,2) for fmean and g = c(3,1,1,2,3,2) for TRA will not give the right result. The safest way of programming with TRA is thus to repeatedly employ the same factor or GRP object for all grouped computations. Atomic vectors passed to g will be converted to ordered factors (see qF) and lists will be converted to ordered GRP objects. This is also done by all Fast Statistical Functions and by default by BY, thus together with these functions, TRA can also safely be used with atomic- or list-groups. Problems may arise if other functions internally convert atomic vectors or lists to groups in a non-sorted way. Note: as.factor conversions are ok as this also involves sorting.

If x is a grouped tibble (grouped_df), TRA matches the columns of x and STATS and also checks for grouping columns names(attr(x, "groups")) in x and STATS. TRA.grouped_df will then only transform those columns in x for which matching counterparts were found in STATS, exempting grouping columns, and returns x again (with columns in the name order). If keep.group_vars = FALSE, the grouping columns are dropped after computation, however the "groups" attribute is not dropped (it can be removed using dplyr::ungroup()).

Examples

Run this code

# NOT RUN {
v <- iris$Sepal.Length          # A numeric vector
f <- iris$Species               # A factor
dat <- num_vars(iris)           # Numeric columns
m <- qM(dat)                    # Matrix of numeric data

head(TRA(v, fmean(v)))                # Simple centering [same as fmean(v, TRA = "-") or W(v)]
head(TRA(m, fmean(m)))                # [same as sweep(m, 2, fmean(m)), fmean(m, TRA = "-") or W(m)]
head(TRA(dat, fmean(dat)))            # [same as fmean(dat, TRA = "-") or W(dat)]
head(TRA(v, fmean(v), "replace"))     # Simple replacing [same as fmean(v, TRA = "replace") or B(v)]
head(TRA(m, fmean(m), "replace"))     # [same as sweep(m, 2, fmean(m)), fmean(m, TRA = 1L) or B(m)]
head(TRA(dat, fmean(dat), "replace")) # [same as fmean(dat, TRA = "replace") or B(dat)]
head(TRA(m, fsd(m), "/"))             # Simple scaling... [same as fsd(m, TRA = "/")]...

# Note: All grouped examples also apply for v and dat...
head(TRA(m, fmean(m, f), "-", f))       # Centering [same as fmean(m, f, TRA = "-") or W(m, f)]
head(TRA(m, fmean(m, f), "replace", f)) # Replacing [same fmean(m, f, TRA = "replace") or B(m, f)]
head(TRA(m, fsd(m, f), "/", f))         # Scaling [same as fsd(m, f, TRA = "/")]

head(TRA(m, fmean(m, f), "-+", f))      # Centering on the overall mean ...
                                        # [same as fmean(m, f, TRA = "-+") or
                                        #           W(m, f, mean = "overall.mean")]
head(TRA(TRA(m, fmean(m, f), "-", f),   # Also the same thing done manually !!
     fmean(m), "+"))

# grouped tibble method
library(dplyr)
iris %>%  group_by(Species) %>% TRA(.,fmean(.))
iris %>%  group_by(Species) %>% fmean(TRA = "-")        # Same thing
iris %>%  group_by(Species) %>% TRA(.,fmean(.)[c(2,4)]) # Only transforming 2 columns
iris %>%  group_by(Species) %>% TRA(.,fmean(.)[c(2,4)], # Dropping species column
                            keep.group_vars = FALSE)
# }

Run the code above in your browser using DataLab