GRP: Fast Grouping / collapse Grouping Objects

Description

GRP performs fast, ordered and unordered, groupings of vectors and data frames (or lists of vectors) using radixorderv or group. The output is a list-like object of class 'GRP' which can be printed, plotted and used as an efficient input to all of collapse's fast statistical and transformation functions / operators, as well as to collap, BY and TRA.

fgroup_by is similar to dplyr::group_by but faster. It creates a grouped data frame with a 'GRP' object attached - for faster dplyr-like programming with collapse's fast functions.

There are also several conversion methods to convert to and from 'GRP' objects. Notable among these is GRP.grouped_df, which returns a 'GRP' object from a grouped data frame created with dplyr::group_by or fgroup_by, and the duo GRP.factor and as_factor_GRP.

gsplit efficiently splits a vector based on a grouping object.

Usage

GRP(X, …)
# S3 method for default
GRP(X, by = NULL, sort = TRUE, decreasing = FALSE, na.last = TRUE,
    return.groups = TRUE, return.order = sort, method = "auto",
    call = TRUE, …)
# S3 method for factor
GRP(X, …, group.sizes = TRUE, drop = FALSE, return.groups = TRUE,
    call = TRUE)
# S3 method for qG
GRP(X, …, group.sizes = TRUE, return.groups = TRUE, call = TRUE)
# S3 method for pseries
GRP(X, effect = 1L, …, group.sizes = TRUE, return.groups = TRUE,
    call = TRUE)
# S3 method for pdata.frame
GRP(X, effect = 1L, …, group.sizes = TRUE, return.groups = TRUE,
    call = TRUE)
# S3 method for grouped_df
GRP(X, …, return.groups = TRUE, call = TRUE)
# Identify, get length, group names, and convert GRP object to factor
is_GRP(x)
# S3 method for GRP
length(x)
GRPnames(x, force.char = TRUE)
as_factor_GRP(x, ordered = FALSE)
# Efficiently split a vector using a grouping object
gsplit(x, g, use.g.names = FALSE, ...)
# Fast, class-agnostic version of dplyr::group_by for use with fast functions, see details
fgroup_by(X, …, sort = TRUE, decreasing = FALSE, na.last = TRUE,
          return.order = sort, method = "auto")
# Shortcut for fgroup_by
      gby(X, …, sort = TRUE, decreasing = FALSE, na.last = TRUE,
          return.order = sort, method = "auto")
# Get grouping columns from a grouped data frame created with dplyr::group_by or fgroup_by
fgroup_vars(X, return = "data")
# Ungroup grouped data frame created with dplyr::group_by or fgroup_by
fungroup(X, …)
# S3 method for GRP
print(x, n = 6, …)
# S3 method for GRP
plot(x, breaks = "auto", type = "s", horizontal = FALSE, …)

Arguments

a vector, list of columns or data frame (default method), or a classed object (conversion / extractor methods).

x, g

a 'GRP' object. For gsplit, x can be a vector of any type, or NULL to return the integer indices of the groups.

if X is a data frame or list, by can indicate columns to use for the grouping (by default all columns are used). Columns must be passed using a vector of column names, indices, or using a one-sided formula i.e. ~ col1 + col2.

sort

logical. If FALSE, groups are not ordered but simply grouped in the order of first appearance of unique elements / rows. This often provides a performance gain if the data was not sorted beforehand. See also method.

ordered

logical. TRUE adds a class 'ordered' i.e. generates an ordered factor.

decreasing

logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in X / by (argument passed to radixorderv).

na.last

logical. If missing values are encountered in grouping vector/columns, assign them to the last group (argument passed to radixorderv).

return.groups

logical. Include the unique groups in the created GRP object.

return.order

logical. Include the output from radixorderv (or group) in the created GRP object. This brings performance improvements in gsplit if sort = TRUE (and thus also benefits grouped execution of base R functions), but has a memory cost by making the object larger.

method

character. The algorithm to use for grouping: either "radix", "hash" or "auto". "auto" will chose "radix" when sort = TRUE, yielding ordered grouping via radixorderv, and "hash"-based grouping in first-appearance order via group otherwise. It is possibly to put method = "radix" and sort = FALSE, which will group character data in first appearance order but sort numeric data (a good hybrid option). method = "hash" currently does not support any sorting, thus putting sort = TRUE will simply be ignored.

group.sizes

logical. TRUE tabulates factor levels using tabulate to create a vector of group sizes; FALSE leaves that slot empty when converting from factors.

drop

logical. TRUE efficiently drops unused factor levels beforehand using fdroplevels.

call

logical. TRUE calls match.call and saves it in the final slot of the GRP object.

force.char

logical. Always output group names as character vector, even if a single numeric vector was passed to GRP.default.

effect

plm methods: Select which panel identifier should be used as grouping variable. 1L takes the first variable in the plm::index, 2L the second etc., identifiers can also be passed as a character string. More than one variable can be supplied.

return

an integer or string specifying what fgroup_vars should return. The options are:

Int.	String	Description
1	"data"	full grouping columns (default)
2	"unique"	unique rows of grouping columns
3	"names"	names of grouping columns
4	"indices"	integer indices of grouping columns
5	"named_indices"	named integer indices of grouping columns
6	"logical"	logical selection vector of grouping columns
7	"named_logical"	named logical selection vector of grouping columns

use.g.names

logical. TRUE returns a named list, like split. FALSE is slightly more efficient.

integer. Number of groups to print out.

breaks

integer. Number of breaks in the histogram of group-sizes.

type

linetype for plot.

horizontal

logical. TRUE arranges plots next to each other, instead of above each other.

…

for fgroup_by: unquoted comma-separated column names, sequences of columns, expressions involving columns, and column names, indices, logical vectors or selector functions. See Examples. For gsplit: further arguments passed to GRP (if g is not already a 'GRP' object).

Value

A list-like object of class `GRP' containing information about the number of groups, the observations (rows) belonging to each group, the size of each group, the unique group names / definitions, whether the groups are ordered or not and the ordering vector used to perform the ordering. The object is structured as follows:

List-index	Element-name	Content type	Content description
[[1]]	N.groups	`integer(1)`	Number of Groups
[[2]]	group.id	`integer(NROW(X))`	An integer group-identifier
[[3]]	group.sizes	`integer(N.groups)`	Vector of group sizes
[[4]]	groups	`unique(X)` or `NULL`	Unique groups (same format as input, except for `fgroup_by` which uses a plain list, sorted if `sort = TRUE`), or `NULL` if `return.groups = FALSE`
[[5]]	group.vars	`character`	The names of the grouping variables
[[6]]	ordered	`logical(2)`	`[1]- TRUE` if `sort = TRUE`, `[2]- TRUE` if `X` already sorted
[[7]]	order	`integer(NROW(X))` or `integer(0)`, with attributes, or `NULL`	Ordering vector from `radixorderv` or `group` (with `"starts"` attribute) or `NULL` if `return.order = FALSE`

Details

GRP is a central function in the collapse package because it provides the key inputs to facilitate easy and efficient groupwise-programming at the C/C++ level: Information about (1) the number of groups (2) an integer group-id indicating which values / rows belong to which group and (3) information about the size of each group. Provided with these informations, collapse's Fast Statistical Functions pre-allocate intermediate and result vectors of the right sizes and (in most cases) perform grouped statistical computations in a single pass through the data.

The sorting and ordering functionality for GRP only affects (2), that is groups receive different integer-id's depending on whether the groups are sorted sort = TRUE, and in which order (argument decreasing). This in-turn changes the order of values/rows in the output of collapse functions.

Next to GRP, there is the function fgroup_by as a significantly faster alternative to dplyr::group_by. It creates a grouped data frame by attaching a 'GRP' object to a data frame. collapse functions with a grouped_df method applied to that data frame will yield grouped computations. Note that fgroup_by can only be used in combination with collapse functions, not with dplyr::summarize or dplyr::mutate (the grouping object and method of computing results is different). The converse is not true, you can group data with dplyr::group_by and then apply collapse functions. Note also the fgroup_by is class-agnostic, i.e. the classes of the data frame or list passed are preserved, and all standard methods (like subsetting with `[` or print methods) apply to the grouped object. Apart from the class 'grouped_df' which is added behind any classes the object might inherit (apart from 'data.frame'), a class 'GRP_df' is added in front. This class responds to print method and subset (`[`) methods. Both first call the corresponding method for the object and then print / attach the grouping information. print.GRP_df prints one line below the object indicating the grouping variables, followed, in square brackets, by some statistics on the group sizes: [N | Mean (SD) Min-Max]. The mean is rounded to a full number and the standard deviation (SD) to one digit. Minimum and maximum are only displayed if the SD is non-zero.

GRP is an S3 generic function with one default method supporting vector and list input and several conversion methods:

The conversion of factors to 'GRP' objects by GRP.factor involves obtaining the number of groups calling ng <- fnlevels(f) and then computing the count of each level using tabulate(f, ng). The integer group-id (2) is already given by the factor itself after removing the levels and class attributes and replacing any missing values with ng + 1L. The levels are put in a list and moved to position (4) in the 'GRP' object, which is reserved for the unique groups. Going from factor to 'GRP' object thus only requires a tabulation of the levels, whereas creating a factor from a 'GRP' object using as_factor_GRP does not involve any computations, but may involve interacting multiple columns using the paste function to produce unique factor levels (if multiple grouping columns were used).

The method GRP.grouped_df takes the 'groups' attribute from a grouped data frame and converts it to a 'GRP' object. If the grouped data frame was generated using fgroup_by, all work is done already. If it was created using dplyr::group_by, a C routine is called to efficiently convert the grouping object.

Note: For faster factor generation and a factor-light class 'qG' which avoids the coercion of factor levels to character also see qF and qG.

Examples

Run this code

# NOT RUN {
## default method
GRP(mtcars$cyl)
GRP(mtcars, ~ cyl + vs + am)       # Or GRP(mtcars, c("cyl","vs","am")) or GRP(mtcars, c(2,8:9))
g <- GRP(mtcars, ~ cyl + vs + am)  # Saving the object
print(g)                           # Printing it
plot(g)                            # Plotting it
GRPnames(g)                        # Retain group names
fsum(mtcars, g)                    # Compute the sum of mtcars, grouped by variables cyl, vs and am
gsplit(mtcars$mpg, g)              # Use the object to split a vector
gsplit(NULL, g)                    # The indices of the groups

## Convert factor to GRP object and vice-versa
GRP(iris$Species)
as_factor_GRP(g)
# }
# NOT RUN {
 <!-- % No code relying on suggested package -->
## dplyr integration
library(dplyr)
mtcars %>% group_by(cyl,vs,am) %>% GRP()    # Get GRP object from a dplyr grouped tibble
mtcars %>% group_by(cyl,vs,am) %>% fmean()  # Grouped mean using dplyr grouping
mtcars %>% fgroup_by(cyl,vs,am) %>% fmean() # Faster alternative with collapse grouping

mtcars %>% fgroup_by(cyl,vs,am)            # Print method for grouped data frame
# }
# NOT RUN {
## Various options for programming and interactive use
library(magrittr)
fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) %>% head(3)
fgroup_by(GGDC10S, 1:3, 5) %>% head(3)
fgroup_by(GGDC10S, c("Variable", "Country")) %>% head(3)
fgroup_by(GGDC10S, is.character) %>% head(3)
fgroup_by(GGDC10S, Country:Variable, Year) %>% head(3)
fgroup_by(GGDC10S, Country:Region, Var = Variable, Year) %>% head(3)

# }

Run the code above in your browser using DataLab