GRP
performs fast, ordered and unordered, groupings of vectors and data frames (or lists of vectors) using radixorderv
or group
. The output is a list-like object of class 'GRP' which can be printed, plotted and used as an efficient input to all of collapse's fast statistical and transformation functions / operators, as well as to collap
, BY
and TRA
.
fgroup_by
is similar to dplyr::group_by
but faster. It creates a grouped data frame with a 'GRP' object attached - for faster dplyr-like programming with collapse's fast functions.
There are also several conversion methods to convert to and from 'GRP' objects. Notable among these is GRP.grouped_df
, which returns a 'GRP' object from a grouped data frame created with dplyr::group_by
or fgroup_by
, and the duo GRP.factor
and as_factor_GRP
.
gsplit
efficiently splits a vector based on a grouping object.
GRP(X, …)# S3 method for default
GRP(X, by = NULL, sort = TRUE, decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto",
call = TRUE, …)
# S3 method for factor
GRP(X, …, group.sizes = TRUE, drop = FALSE, return.groups = TRUE,
call = TRUE)
# S3 method for qG
GRP(X, …, group.sizes = TRUE, return.groups = TRUE, call = TRUE)
# S3 method for pseries
GRP(X, effect = 1L, …, group.sizes = TRUE, return.groups = TRUE,
call = TRUE)
# S3 method for pdata.frame
GRP(X, effect = 1L, …, group.sizes = TRUE, return.groups = TRUE,
call = TRUE)
# S3 method for grouped_df
GRP(X, …, return.groups = TRUE, call = TRUE)
# Identify, get length, group names, and convert GRP object to factor
is_GRP(x)
# S3 method for GRP
length(x)
GRPnames(x, force.char = TRUE)
as_factor_GRP(x, ordered = FALSE)
# Efficiently split a vector using a grouping object
gsplit(x, g, use.g.names = FALSE, ...)
# Fast, class-agnostic version of dplyr::group_by for use with fast functions, see details
fgroup_by(X, …, sort = TRUE, decreasing = FALSE, na.last = TRUE,
return.order = sort, method = "auto")
# Shortcut for fgroup_by
gby(X, …, sort = TRUE, decreasing = FALSE, na.last = TRUE,
return.order = sort, method = "auto")
# Get grouping columns from a grouped data frame created with dplyr::group_by or fgroup_by
fgroup_vars(X, return = "data")
# Ungroup grouped data frame created with dplyr::group_by or fgroup_by
fungroup(X, …)
# S3 method for GRP
print(x, n = 6, …)
# S3 method for GRP
plot(x, breaks = "auto", type = "s", horizontal = FALSE, …)
a vector, list of columns or data frame (default method), or a classed object (conversion / extractor methods).
a 'GRP' object. For gsplit
, x
can be a vector of any type, or NULL
to return the integer indices of the groups.
if X
is a data frame or list, by
can indicate columns to use for the grouping (by default all columns are used). Columns must be passed using a vector of column names, indices, or using a one-sided formula i.e. ~ col1 + col2
.
logical. If FALSE
, groups are not ordered but simply grouped in the order of first appearance of unique elements / rows. This often provides a performance gain if the data was not sorted beforehand. See also method
.
logical. TRUE
adds a class 'ordered' i.e. generates an ordered factor.
logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in X
/ by
(argument passed to radixorderv
).
logical. If missing values are encountered in grouping vector/columns, assign them to the last group (argument passed to radixorderv
).
logical. Include the unique groups in the created GRP object.
logical. Include the output from radixorderv
(or group
) in the created GRP object. This brings performance improvements in gsplit
if sort = TRUE
(and thus also benefits grouped execution of base R functions), but has a memory cost by making the object larger.
character. The algorithm to use for grouping: either "radix"
, "hash"
or "auto"
. "auto"
will chose "radix"
when sort = TRUE
, yielding ordered grouping via radixorderv
, and "hash"
-based grouping in first-appearance order via group
otherwise. It is possibly to put method = "radix"
and sort = FALSE
, which will group character data in first appearance order but sort numeric data (a good hybrid option). method = "hash"
currently does not support any sorting, thus putting sort = TRUE
will simply be ignored.
logical. TRUE
tabulates factor levels using tabulate
to create a vector of group sizes; FALSE
leaves that slot empty when converting from factors.
logical. TRUE
efficiently drops unused factor levels beforehand using fdroplevels
.
logical. TRUE
calls match.call
and saves it in the final slot of the GRP object.
logical. Always output group names as character vector, even if a single numeric vector was passed to GRP.default
.
plm methods: Select which panel identifier should be used as grouping variable. 1L takes the first variable in the plm::index
, 2L the second etc., identifiers can also be passed as a character string. More than one variable can be supplied.
an integer or string specifying what fgroup_vars
should return. The options are:
Int. | String | Description | ||
1 | "data" | full grouping columns (default) | ||
2 | "unique" | unique rows of grouping columns | ||
3 | "names" | names of grouping columns | ||
4 | "indices" | integer indices of grouping columns | ||
5 | "named_indices" | named integer indices of grouping columns | ||
6 | "logical" | logical selection vector of grouping columns | ||
7 | "named_logical" | named logical selection vector of grouping columns |
logical. TRUE
returns a named list, like split
. FALSE
is slightly more efficient.
integer. Number of groups to print out.
integer. Number of breaks in the histogram of group-sizes.
linetype for plot.
logical. TRUE
arranges plots next to each other, instead of above each other.
for fgroup_by
: unquoted comma-separated column names, sequences of columns, expressions involving columns, and column names, indices, logical vectors or selector functions. See Examples. For gsplit
: further arguments passed to GRP
(if g
is not already a 'GRP' object).
A list-like object of class `GRP' containing information about the number of groups, the observations (rows) belonging to each group, the size of each group, the unique group names / definitions, whether the groups are ordered or not and the ordering vector used to perform the ordering. The object is structured as follows:
List-index | Element-name | Content type | Content description | |||
[[1]] |
N.groups | integer(1) |
Number of Groups | |||
[[2]] |
group.id | integer(NROW(X)) |
An integer group-identifier | |||
[[3]] |
group.sizes | integer(N.groups) |
Vector of group sizes | |||
[[4]] |
groups | unique(X) or NULL |
Unique groups (same format as input, except for fgroup_by which uses a plain list, sorted if sort = TRUE ), or NULL if return.groups = FALSE |
|||
[[5]] |
group.vars | character |
The names of the grouping variables | |||
[[6]] | ordered | logical(2) |
[1]- TRUE if sort = TRUE , [2]- TRUE if X already sorted |
|||
[[7]] |
order | integer(NROW(X)) or integer(0) , with attributes, or NULL |
Ordering vector from radixorderv or group (with "starts" attribute) or NULL if return.order = FALSE |
GRP
is a central function in the collapse package because it provides the key inputs to facilitate easy and efficient groupwise-programming at the C/C++
level: Information about (1) the number of groups (2) an integer group-id indicating which values / rows belong to which group and (3) information about the size of each group. Provided with these informations, collapse's Fast Statistical Functions pre-allocate intermediate and result vectors of the right sizes and (in most cases) perform grouped statistical computations in a single pass through the data.
The sorting and ordering functionality for GRP
only affects (2), that is groups receive different integer-id's depending on whether the groups are sorted sort = TRUE
, and in which order (argument decreasing
). This in-turn changes the order of values/rows in the output of collapse functions.
Next to GRP
, there is the function fgroup_by
as a significantly faster alternative to dplyr::group_by
. It creates a grouped data frame by attaching a 'GRP' object to a data frame. collapse functions with a grouped_df method applied to that data frame will yield grouped computations. Note that fgroup_by
can only be used in combination with collapse functions, not with dplyr::summarize
or dplyr::mutate
(the grouping object and method of computing results is different). The converse is not true, you can group data with dplyr::group_by
and then apply collapse functions. Note also the fgroup_by
is class-agnostic, i.e. the classes of the data frame or list passed are preserved, and all standard methods (like subsetting with `[`
or print
methods) apply to the grouped object. Apart from the class 'grouped_df' which is added behind any classes the object might inherit (apart from 'data.frame'), a class 'GRP_df' is added in front. This class responds to print
method and subset (`[`
) methods. Both first call the corresponding method for the object and then print / attach the grouping information. print.GRP_df
prints one line below the object indicating the grouping variables, followed, in square brackets, by some statistics on the group sizes: [N | Mean (SD) Min-Max]
. The mean is rounded to a full number and the standard deviation (SD) to one digit. Minimum and maximum are only displayed if the SD is non-zero.
GRP
is an S3 generic function with one default method supporting vector and list input and several conversion methods:
The conversion of factors to 'GRP' objects by GRP.factor
involves obtaining the number of groups calling ng <- fnlevels(f)
and then computing the count of each level using tabulate(f, ng)
. The integer group-id (2) is already given by the factor itself after removing the levels and class attributes and replacing any missing values with ng + 1L
. The levels are put in a list and moved to position (4) in the 'GRP' object, which is reserved for the unique groups. Going from factor to 'GRP' object thus only requires a tabulation of the levels, whereas creating a factor from a 'GRP' object using as_factor_GRP
does not involve any computations, but may involve interacting multiple columns using the paste
function to produce unique factor levels (if multiple grouping columns were used).
The method GRP.grouped_df
takes the 'groups' attribute from a grouped data frame and converts it to a 'GRP' object. If the grouped data frame was generated using fgroup_by
, all work is done already. If it was created using dplyr::group_by
, a C routine is called to efficiently convert the grouping object.
Note: For faster factor generation and a factor-light class 'qG' which avoids the coercion of factor levels to character also see qF
and qG
.
radixorder
, qF
, Fast Grouping and Ordering, Collapse Overview
# NOT RUN {
## default method
GRP(mtcars$cyl)
GRP(mtcars, ~ cyl + vs + am) # Or GRP(mtcars, c("cyl","vs","am")) or GRP(mtcars, c(2,8:9))
g <- GRP(mtcars, ~ cyl + vs + am) # Saving the object
print(g) # Printing it
plot(g) # Plotting it
GRPnames(g) # Retain group names
fsum(mtcars, g) # Compute the sum of mtcars, grouped by variables cyl, vs and am
gsplit(mtcars$mpg, g) # Use the object to split a vector
gsplit(NULL, g) # The indices of the groups
## Convert factor to GRP object and vice-versa
GRP(iris$Species)
as_factor_GRP(g)
# }
# NOT RUN {
<!-- % No code relying on suggested package -->
## dplyr integration
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% GRP() # Get GRP object from a dplyr grouped tibble
mtcars %>% group_by(cyl,vs,am) %>% fmean() # Grouped mean using dplyr grouping
mtcars %>% fgroup_by(cyl,vs,am) %>% fmean() # Faster alternative with collapse grouping
mtcars %>% fgroup_by(cyl,vs,am) # Print method for grouped data frame
# }
# NOT RUN {
## Various options for programming and interactive use
library(magrittr)
fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) %>% head(3)
fgroup_by(GGDC10S, 1:3, 5) %>% head(3)
fgroup_by(GGDC10S, c("Variable", "Country")) %>% head(3)
fgroup_by(GGDC10S, is.character) %>% head(3)
fgroup_by(GGDC10S, Country:Variable, Year) %>% head(3)
fgroup_by(GGDC10S, Country:Region, Var = Variable, Year) %>% head(3)
# }
Run the code above in your browser using DataLab