factor_: A cheaper version of `factor()` along with cheaper utilities

Description

A fast version of factor() using the collapse package.

There are some additional utilities, most of which begin with the prefix 'levels_', such as as_factor() which is an efficient way to coerce both vectors and factors, levels_factor() which returns the levels of a factor, as a factor, levels_used() which returns the used levels of a factor, levels_unused() which returns the unused levels of a factor, levels_add() adds the specified levels onto the existing levels, levels_rm() removes the specified levels, levels_add_na() which adds an explicit NA level, levels_drop_na() which drops the NA level, levels_drop() which drops unused factor levels, levels_rename() for renaming levels, levels_lump() which returns top n levels and lumps all others into the same category,
levels_count() which returns the counts of each level, and finally levels_reorder() which reorders the levels of x based on y using the ordered median values of y for each level.

Usage

factor_(
  x = integer(),
  levels = NULL,
  order = TRUE,
  na_exclude = TRUE,
  ordered = is.ordered(x)
)
as_factor(x)
levels_factor(x)
levels_used(x)
levels_unused(x)
levels_rm(x, levels)
levels_add(x, levels, where = c("last", "first"))
levels_add_na(x, name = NA, where = c("last", "first"))
levels_drop_na(x)
levels_drop(x)
levels_reorder(x, order_by, decreasing = FALSE)
levels_rename(x, ..., .fun = NULL)
levels_lump(
  x,
  n,
  prop,
  other_category = "Other",
  ties = c("min", "average", "first", "last", "random", "max")
)
levels_count(x)

Value

A factor or character in the case of levels_used and levels_unused. levels_count returns a data frame of counts and proportions for each level.

Arguments

x: A vector.
levels: Optional factor levels.
order: Should factor levels be sorted? Default is TRUE. It typically is faster to set this to FALSE, in which case the levels are sorted by order of first appearance.
na_exclude: Should NA values be excluded from the factor levels? Default is TRUE.
ordered: Should the result be an ordered factor?
where: Where should NA level be placed? Either first or last.
name: Name of NA level.
order_by: A vector to order the levels of x by using the medians of order_by.
decreasing: Should the reordered levels be in decreasing order? Default is FALSE.
...: Key-value pairs where the key is the new name and value is the name to replace that with the new name. For example levels_rename(x, new = old) replaces the level "old" with the level "new".
.fun: Renaming function applied to each level.
n: Top n number of levels to calculate.
prop: Top proportion of levels to calculate. This is a proportion of the total unique levels in x.
other_category: Name of 'other' category.
ties: Ties method to use. See ?rank.

Details

This operates similarly to collapse::qF().
The main difference internally is that collapse::funique() is used and therefore s3 methods can be written for it.
Furthermore, for date-times factor_ differs in that it differentiates all instances in time whereas factor differentiates calendar times. Using a daylight savings example where the clocks go back:
factor(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5)) produces 4 levels whereas
factor_(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5)) produces 5 levels.

levels_lump() is a cheaper version of forcats::lump_n() but returns levels in order of highest frequency to lowest. This can be very useful for plotting.

Examples

Run this code

library(cheapr)

x <- factor_(sample(letters[sample.int(26, 10)], 100, TRUE), levels = letters)
x
# Used/unused levels

levels_used(x)
levels_unused(x)

# Drop unused levels
levels_drop(x)

# Top 3 letters by by frequency
lumped_letters <- levels_lump(x, 3)
levels_count(lumped_letters)

# To remove the "other" category, use `levels_rm()`

levels_count(levels_rm(lumped_letters, "Other"))

# We can use levels_lump to create a generic top n function for non-factors too

get_top_n <- function(x, n){
  f <- levels_lump(factor_(x, order = FALSE), n = n)
  levels_count(f)
}

get_top_n(x, 3)

# A neat way to order the levels of a factor by frequency
# is the following:

levels(levels_lump(x, prop = 1)) # Highest to lowest
levels(levels_lump(x, prop = -1)) # Lowest to highest

Run the code above in your browser using DataLab