Learn R Programming

cheapr (version 1.1.0)

factor_: A cheaper version of factor() along with cheaper utilities

Description

A fast version of factor() using the collapse package.

There are some additional utilities, most of which begin with the prefix 'levels_', such as as_factor() which is an efficient way to coerce both vectors and factors, levels_factor() which returns the levels of a factor, as a factor, levels_used() which returns the used levels of a factor, levels_unused() which returns the unused levels of a factor, levels_add() adds the specified levels onto the existing levels, levels_rm() removes the specified levels, levels_add_na() which adds an explicit NA level, levels_drop_na() which drops the NA level, levels_drop() which drops unused factor levels, levels_rename() for renaming levels, levels_lump() which returns top n levels and lumps all others into the same category,
levels_count() which returns the counts of each level, and finally levels_reorder() which reorders the levels of x based on y using the ordered median values of y for each level.

Usage

factor_(
  x = integer(),
  levels = NULL,
  order = TRUE,
  na_exclude = TRUE,
  ordered = is.ordered(x)
)

as_factor(x)

levels_factor(x)

levels_used(x)

levels_unused(x)

levels_rm(x, levels)

levels_add(x, levels, where = c("last", "first"))

levels_add_na(x, name = NA, where = c("last", "first"))

levels_drop_na(x)

levels_drop(x)

levels_reorder(x, order_by, decreasing = FALSE)

levels_rename(x, ..., .fun = NULL)

levels_lump( x, n, prop, other_category = "Other", ties = c("min", "average", "first", "last", "random", "max") )

levels_count(x)

Value

A factor or character in the case of levels_used and levels_unused. levels_count returns a data frame of counts and proportions for each level.

Arguments

x

A vector.

levels

Optional factor levels.

order

Should factor levels be sorted? Default is TRUE. It typically is faster to set this to FALSE, in which case the levels are sorted by order of first appearance.

na_exclude

Should NA values be excluded from the factor levels? Default is TRUE.

ordered

Should the result be an ordered factor?

where

Where should NA level be placed? Either first or last.

name

Name of NA level.

order_by

A vector to order the levels of x by using the medians of order_by.

decreasing

Should the reordered levels be in decreasing order? Default is FALSE.

...

Key-value pairs where the key is the new name and value is the name to replace that with the new name. For example levels_rename(x, new = old) replaces the level "old" with the level "new".

.fun

Renaming function applied to each level.

n

Top n number of levels to calculate.

prop

Top proportion of levels to calculate. This is a proportion of the total unique levels in x.

other_category

Name of 'other' category.

ties

Ties method to use. See ?rank.

Details

This operates similarly to collapse::qF().
The main difference internally is that collapse::funique() is used and therefore s3 methods can be written for it.
Furthermore, for date-times factor_ differs in that it differentiates all instances in time whereas factor differentiates calendar times. Using a daylight savings example where the clocks go back:
factor(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5)) produces 4 levels whereas
factor_(as.POSIXct(1729984360, tz = "Europe/London") + 3600 *(1:5)) produces 5 levels.

levels_lump() is a cheaper version of forcats::lump_n() but returns levels in order of highest frequency to lowest. This can be very useful for plotting.

Examples

Run this code
library(cheapr)

x <- factor_(sample(letters[sample.int(26, 10)], 100, TRUE), levels = letters)
x
# Used/unused levels

levels_used(x)
levels_unused(x)

# Drop unused levels
levels_drop(x)

# Top 3 letters by by frequency
lumped_letters <- levels_lump(x, 3)
levels_count(lumped_letters)

# To remove the "other" category, use `levels_rm()`

levels_count(levels_rm(lumped_letters, "Other"))

# We can use levels_lump to create a generic top n function for non-factors too

get_top_n <- function(x, n){
  f <- levels_lump(factor_(x, order = FALSE), n = n)
  levels_count(f)
}

get_top_n(x, 3)

# A neat way to order the levels of a factor by frequency
# is the following:

levels(levels_lump(x, prop = 1)) # Highest to lowest
levels(levels_lump(x, prop = -1)) # Lowest to highest

Run the code above in your browser using DataLab