qF
, shorthand for 'quick-factor' implements very fast factor generation from atomic vectors using either radix ordering or index hashing.
qG
, shorthand for 'quick-group', generates a kind of factor-light without the levels attribute but instead an attribute providing the number of levels. Optionally the levels / groups can be attached, but without converting them to character. Objects have a class 'qG'. A multivariate version is provided by the function group
.
finteraction
generates a factor by interacting multiple vectors or factors. In that process missing values are always replaced with a level and unused levels are always dropped.
qF(x, ordered = FALSE, na.exclude = TRUE, sort = TRUE, drop = FALSE,
keep.attr = TRUE, method = "auto")qG(x, ordered = FALSE, na.exclude = TRUE, sort = TRUE,
return.groups = FALSE, method = "auto")
is_qG(x)
as_factor_qG(x, ordered = FALSE, na.exclude = TRUE)
finteraction(…, ordered = FALSE, sort = TRUE, method = "auto")
a atomic vector, factor or quick-group.
logical. Adds a class 'ordered'.
logical. TRUE
preserves missing values (i.e. no level is generated for NA
).
logical. TRUE
sorts the levels in ascending order (like factor
); FALSE
provides the levels in order of first appearance, which can be significantly faster. Note that if a factor is passed, only sort = FALSE
takes effect (as factors usually have sorted levels and checking sortedness can be expensive).
logical. If x
is a factor, TRUE
efficiently drops unused factor levels beforehand using fdroplevels
.
logical. If TRUE
and x
has additional attributes apart from 'levels' and 'class', these are preserved in the conversion to factor.
an integer or character string specifying the method of computation:
Int. | String | Description | ||
1 | "auto" | automatic selection: hash for character, logical, if sort = FALSE or if length(x) < 500 , else radix. |
||
2 | "radix" | use radix ordering to generate factors. Supports sort = FALSE only for character vectors. See Details. |
||
3 | "hash" | use index hashing to generate factors. See Details. |
Note that for finteraction
, method = "hash"
is always unsorted.
logical. TRUE
returns the unique elements / groups / levels of x
in an attribute called 'groups'. Unlike qF
, they are not converted to character.
multiple atomic vectors or factors, or a single list of equal-length vectors or factors. See Details.
qF
and finteraction
return an (ordered) factor. qG
returns an object of class 'qG': an integer grouping vector with an attribute 'N.groups' indicating the number of groups, and, if return.groups = TRUE
, an attribute 'groups' containing the vector of unique groups / elements in x
corresponding to the integer-id.
These functions are quite important. Whenever a vector is passed to a collapse function such as fmean(mtcars, mtcars$cyl)
, is is grouped using qF
or qG
.
qF
is a combination of as.factor
and factor
. Applying it to a vector i.e. qF(x)
gives the same result as as.factor(x)
. qF(x, ordered = TRUE)
generates an ordered factor (same as factor(x, ordered = TRUE)
), and qF(x, na.exclude = FALSE)
generates a level for missing values (same as factor(x, exclude = NULL)
). An important addition is that qF(x, na.exclude = FALSE)
also adds a class 'na.included'. This prevents collapse functions from checking missing values in the factor, and is thus computationally more efficient. Therefore factors used in grouped operations should preferably be generated using qF(x, na.exclude = FALSE)
. Setting sort = FALSE
gathers the levels in first-appearance order (unless method = "radix"
and x
is numeric, in which case the levels are always sorted). This can provide a speed improvement, particularly for character data.
There are 3 methods of computation: radix ordering, index hashing, and hashing based on group
. Radix ordering is done through combining the functions radixorder
and groupid
. It is generally faster than index hashing for large numeric data (although there are exceptions). Index hashing is done using Rcpp::sugar::sort_unique
and Rcpp::sugar::match
. It is generally faster for character data. If sort = FALSE
, group
is used which is also very fast.
Regarding speed: In general qF
is around 5x faster than as.factor
on character data and about 30x faster on numeric data. Automatic method dispatch typically does a good job delivering optimal performance.
qG
is in the first place a programmers function. It generates a factor-'light' consisting of only an integer grouping vector and an attribute providing the number of groups. It is slightly faster and more memory efficient than GRP
for grouping atomic vectors, which is the main reason it exists. The fact that it (optionally) returns the unique groups / levels without converting them to character is an added bonus (this also provides a small performance gain compared to qF
). Since v1.7, you can also call a C-level function group
directly, which works for multivariate data as well, but does not sort the data and does not preserve missing values.
finteraction
is simply a wrapper around as_factor_GRP(GRP.default(X))
, where X is replaced by the arguments in '…' combined in a list (so it's not really an interaction function but just a multivariate grouping converted to factor, see GRP
for computational details). In general: All vectors, factors, or lists of vectors / factors passed can be interacted. Interactions always create a level for missing values and always drop any unused levels.
group
, groupid
, GRP
, Fast Grouping and Ordering, Collapse Overview
# NOT RUN {
cylF <- qF(mtcars$cyl) # Factor from atomic vector
cylG <- qG(mtcars$cyl) # Quick-group from atomic vector
cylG # See the simple structure of this object
cf <- qF(wlddev$country) # Bigger data
cf2 <- qF(wlddev$country, na.exclude = FALSE) # With na.included class
dat <- num_vars(wlddev)
# }
# NOT RUN {
<!-- % No code relying on suggested package -->
# cf2 is faster in grouped operations because no missing value check is performed
library(microbenchmark)
microbenchmark(fmax(dat, cf), fmax(dat, cf2))
# }
# NOT RUN {
finteraction(mtcars$cyl, mtcars$vs) # Interacting two variables (can be factors)
head(finteraction(mtcars)) # A more crude example..
# }
Run the code above in your browser using DataLab