split_var: Split numeric variables into smaller groups

Description

Recode numeric variables into equal sized groups, i.e. a variable is cut into a smaller number of groups at specific cut points. split_var_if() is a scoped variant of split_var(), where transformation will be applied only to those variables that match the logical condition of predicate.

Usage

split_var(x, ..., n, as.num = FALSE, val.labels = NULL,
  var.label = NULL, inclusive = FALSE, append = TRUE,
  suffix = "_g")
split_var_if(x, predicate, n, as.num = FALSE, val.labels = NULL,
  var.label = NULL, inclusive = FALSE, append = TRUE,
  suffix = "_g")

Arguments

A vector or data frame.

...

Optional, unquoted names of variables that should be selected for further processing. Required, if x is a data frame (and no vector) and only selected variables from x should be processed. You may also use functions like : or tidyselect's select_helpers. See 'Examples' or package-vignette.

The new number of groups that x should be split into.

as.num

Logical, if TRUE, return value will be numeric, not a factor.

val.labels

Optional character vector, to set value label attributes of recoded variable (see vignette Labelled Data and the sjlabelled-Package). If NULL (default), no value labels will be set. Value labels can also be directly defined in the rec-syntax, see 'Details'.

var.label

Optional string, to set variable label attribute for the returned variable (see vignette Labelled Data and the sjlabelled-Package). If NULL (default), variable label attribute of x will be used (if present). If empty, variable label attributes will be removed.

inclusive

Logical; if TRUE, cut point value are included in the preceeding group. This may be necessary if cutting a vector into groups does not define proper ("equal sized") group sizes. See 'Note' and 'Examples'.

append

Logical, if TRUE (the default) and x is a data frame, x including the new variables as additional columns is returned; if FALSE, only the new variables are returned.

suffix

String value, will be appended to variable (column) names of x, if x is a data frame. If x is not a data frame, this argument will be ignored. The default value to suffix column names in a data frame depends on the function call:

recoded variables (rec()) will be suffixed with "_r"
recoded variables (recode_to()) will be suffixed with "_r0"
dichotomized variables (dicho()) will be suffixed with "_d"
grouped variables (split_var()) will be suffixed with "_g"
grouped variables (group_var()) will be suffixed with "_gr"
standardized variables (std()) will be suffixed with "_z"
centered variables (center()) will be suffixed with "_c"
de-meaned variables (de_mean()) will be suffixed with "_dm"
grouped-meaned variables (de_mean()) will be suffixed with "_gm"

If suffix = "" and append = TRUE, existing variables that have been recoded/transformed will be overwritten.

predicate

A predicate function to be applied to the columns. The variables for which predicate returns TRUE are selected.

Value

A grouped variable with equal sized groups. If x is a data frame, for append = TRUE, x including the grouped variables as new columns is returned; if append = FALSE, only the grouped variables will be returned. If append = TRUE and suffix = "", recoded variables will replace (overwrite) existing variables.

Details

split_var() splits a variable into equal sized groups, where the amount of groups depends on the groupcount-argument. Thus, this functions cuts a variable into groups at the specified quantiles.

By contrast, group_var recodes a variable into groups, where groups have the same value range (e.g., from 1-5, 6-10, 11-15 etc.).

split_var() also works on grouped data frames (see group_by). In this case, splitting is applied to the subsets of variables in x. See 'Examples'.

Examples

Run this code

# NOT RUN {
data(efc)
# non-grouped
table(efc$neg_c_7)

# split into 3 groups
table(split_var(efc$neg_c_7, n = 3))

# split multiple variables into 3 groups
split_var(efc, neg_c_7, pos_v_4, e17age, n = 3, append = FALSE)
frq(split_var(efc, neg_c_7, pos_v_4, e17age, n = 3, append = FALSE))

# original
table(efc$e42dep)

# two groups, non-inclusive cut-point
# vector split leads to unequal group sizes
table(split_var(efc$e42dep, n = 2))

# two groups, inclusive cut-point
# group sizes are equal
table(split_var(efc$e42dep, n = 2, inclusive = TRUE))

# Unlike dplyr's ntile(), split_var() never splits a value
# into two different categories, i.e. you always get a clean
# separation of original categories
library(dplyr)

x <- dplyr::ntile(efc$neg_c_7, n = 3)
table(efc$neg_c_7, x)

x <- split_var(efc$neg_c_7, n = 3)
table(efc$neg_c_7, x)

# works also with gouped data frames
mtcars %>%
  split_var(disp, n = 3, append = FALSE) %>%
  table()

mtcars %>%
  group_by(cyl) %>%
  split_var(disp, n = 3, append = FALSE) %>%
  table()

# }