tabl: Construct Value Label-Friendly Frequency Tables

Description

tabl calculates raw or weighted frequency counts (or proportions) over arbitrary categorical values (including integer values), which may be expressed in terms of raw variable values or labelr label values.

Usage

tabl(
  data,
  vars = NULL,
  labs.on = TRUE,
  qtiles = 4,
  prop.digits = NULL,
  wt = NULL,
  div.by = NULL,
  max.unique.vals = 10,
  sort.freq = TRUE,
  zero.rm = FALSE,
  irreg.rm = FALSE,
  wide.col = NULL
)

Value

a data.frame.

Arguments

data: a data.frame.
vars: a quoted character vector of variable names of variables you wish to include in defining category groups to tabulate over in the table. If NULL tabl will attempt to construct a table over all combinations of all non-decimal-having variables in the data.frame that do not exceed the max.unique.vals threshold. Additionally, note the effects of the qtiles argument.
labs.on: if TRUE (the default), then value labels -- rather than the raw variable values -- will be displayed in the returned table for any value-labeled variables. Variables need not be value-labeled: This command (with this option set to TRUE or FALSE) will work even when no variables are value-labeled.
qtiles: if not NULL, must be a 1L integer between 2 and 100 indicating the number of quantile categories to employ in temporarily (for purposes of tabulation) auto-value-labeling numeric columns that exceed the max.unique.vals threshold. If NULL, no such auto-value-labeling will take place. Note: When labs.on = TRUE, any pre-existing variable value labels will be used in favor of the quantile value labels generated by this argument. By default, qtiles = 4, and the automatically generated quantile category levels will be labeled as "q025" (i.e., first quartile), "q050", "q075", and "q100".
prop.digits: if non-NULL, cell percentages (proportions) will be returned instead of frequency counts, and these will be rounded to the digit specified (e.g., prop.digits = 3 means a value of 0.157 would be returned for a cell that accounted for 8 observations if the total number of observations were 51). If NULL (the default), frequency counts will be returned.
wt: an optional vector that includes cell counts or some other idiosyncratic "importance" weight. If NULL, no weighting will be employed.
div.by: Divide the returned counts by a constant for scaling purposes. This may be a number (e.g., div.by = 10 to divide by 10) or a character that follows the convention "number followed by 'K', 'M', or 'B'", where, e.g., "10K" is translated as 10000, "1B" is translated as 1000000000, etc.
max.unique.vals: Integer to specify the maximum number of unique values of a variable that may be observed for that variable to be included in tabulations. Note that labelr sets a hard ceiling of 5000 on the total number of unique value labels that any variable is permitted to have under any circumstance, as labelr is primarily intended for interactive use with moderately-sized data.frames. See the qtiles argument for an approach to incorporating many-valued numeric variables that exceed the max.unique.vals threshold.
sort.freq: By default, returned table rows are sorted in descending order of cell frequency (most frequent categories/combinations first). If set to FALSE, table rows will be sorted by the the distinct values of the vars (in the order vars are specified).
zero.rm: If TRUE, zero-frequency vars categories/combinations (i.e., those not observed in the data.frame) will be filtered from the table. For tables that would produce more than 10000 rows, this is done automatically.
irreg.rm: If TRUE, tabulations exclude cases where any applicable variable (see vars argument) features any of the following "irregular" values: NA, NaN, Inf, -Inf, or any non-case-sensitive variation on "NA", "NAN", "INF", or "-INF." If FALSE, all "irregular" values (as just defined) are assigned to a "catch-all" category of NA that is featured in the returned table (if/where present).
wide.col: If non-NULL, this is the quoted name of a single column / var of supplied data.frame whose distinct values (category levels) you wish to be columns of the returned table. For example, if you are interested in a cross-tab of "edu" (highest level of education) and "race" (a race/ethnicity variable), you could supply vars= c("edu") and wide.col = "race", and the different racial-ethnic group categories would appear as distinct columns, with "edu" category levels appearing as distinct rows, and cell values representing the cross-tabbed cell "edu" level frequencies for the respective "race" groups (see examples). You may supply one wide.col at most.

Details

This function creates a labelr-friendly data.frame representation of multi-variable tabular data, where either value labels or values can be displayed (use of value labels is the default), and where various convenience options are provided, such as using frequency weights, using proportions instead of counts, rounding those percentages, automatically expressing many-valued, non-value-labeled numerical variables in terms of quantile category groups, or pivoting / casting one of the categorical variables' levels (labels) to serve as columns in a cross-tab-like table.

Examples

Run this code

# assign mtcars to new data.frame df
df <- mtcars

# add na values to make things interesting
df[1, 1:11] <- NA
rownames(df)[1] <- "Missing Car"

# add value labels
df <- add_val_labs(
  data = df,
  vars = "am",
  vals = c(0, 1),
  labs = c("automatic", "manual")
)

df <- add_val_labs(
  data = df,
  vars = "carb",
  vals = c(1, 2, 3, 4, 6, 8),
  labs = c(
    "1-carb", "2-carbs",
    "3-carbs", "4-carbs",
    "6-carbs", "8-carbs"
  )
)

# var arg can be unquoted if using add_val1()
# note that this is not add_val_labs(); add_val1() has "var" arg instead of "vars
df <- add_val1(
  data = df,
  var = cyl, # note, "var," not "vars" arg
  vals = c(4, 6, 8),
  labs = c(
    "four-cyl",
    "six-cyl",
    "eight-cyl"
  )
)

df <- add_val_labs(
  data = df,
  vars = "gear",
  vals = 3:5,
  labs = c(
    "3-speed",
    "4-speed",
    "5-speed"
  )
)


# lookup mapping
get_val_labs(df)

# introduce other "irregular" values
df$am[1] <- NA

df[2, "am"] <- NaN
df[3, "am"] <- -Inf
df[5, "cyl"] <- "NAN"

# take a look
head(df)

# demonstrate tabl() frequency tabulation function

# this is the "first call" that will be referenced repeatedly below
# labels on, sort by variable values, suppress/exclude NA/irregular values
# ...return counts
tabl(df,
  vars = c("cyl", "am"),
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = TRUE, # NAs and the like are suppressed
  prop.digits = NULL
) # return counts, not proportions

# same as "first call", except now value labels are off
tabl(df,
  vars = c("cyl", "am"),
  labs.on = FALSE, # use variable values
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = TRUE, # NAs and the like are suppressed
  prop.digits = NULL
) # return counts, not proportions

# same as "first call," except now proportions instead of counts
tabl(df,
  vars = c("cyl", "am"),
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = TRUE, # NAs and the like are suppressed
  prop.digits = 3
) # return proportions, rounded to 3rd decimal

# same as "first call," except now sort by frequency counts
tabl(df,
  vars = c("cyl", "am"),
  labs.on = TRUE, # use variable value labels
  sort.freq = TRUE, # sort in order of descending frequency
  irreg.rm = TRUE, # NAs and the like are suppressed
  prop.digits = NULL
) # return proportions, rounded to 3rd decimal

# same as "first call," except now use weights
set.seed(2944) # for reproducibility
df$freqwt <- sample(10:50, nrow(df), replace = TRUE) # create (fake) freq wts
tabl(df,
  vars = c("cyl", "am"),
  wt = "freqwt", # use frequency weights
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = FALSE, # NAs and the like are included/shown
  prop.digits = NULL
) # return counts, not proportions

df$freqwt <- NULL # we don't need this anymore

# now, with extremely large weights to illustrate div.by
set.seed(428441) # for reproducibility
df$freqwt <- sample(1000000:10000000, nrow(df), replace = TRUE) # large freq wts
tabl(df,
  vars = c("cyl", "am"),
  wt = "freqwt", # use frequency weights
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = FALSE, # NAs and the like are included/shown
  prop.digits = NULL
) # return counts, not proportions

# show div by - Millions
tabl(df,
  vars = c("cyl", "am"),
  wt = "freqwt", # use frequency weights
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = FALSE, # NAs and the like are included/shown
  prop.digits = NULL, # return counts, not proportions
  div.by = "1M"
) # one million

# show div by - Tens of millions
tabl(df,
  vars = c("cyl", "am"),
  wt = "freqwt", # use frequency weights
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = FALSE, # NAs and the like are included/shown
  prop.digits = NULL, # return counts, not proportions
  div.by = "10M"
) # ten million

# show div by - 10000
tabl(df,
  vars = c("cyl", "am"),
  wt = "freqwt", # use frequency weights
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = FALSE, # NAs and the like are included/shown
  prop.digits = NULL, # return counts, not proportions
  div.by = 10000
) # ten thousand; could've used div.by = "10K"

# show div by - 10000, but different syntax
tabl(df,
  vars = c("cyl", "am"),
  wt = "freqwt", # use frequency weights
  labs.on = TRUE, # use variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = FALSE, # NAs and the like are included/shown
  prop.digits = NULL, # return counts, not proportions
  div.by = "10K"
) # ten thousand; could've used div.by = 10000

df$freqwt <- NULL # we don't need this anymore

# turn labels off, to make this more compact
# do not show zero values (zero.rm)
# do not show NA values (irreg.rm)
# many-valued numeric variables will be converted to quantile categories by
# ...qtiles argument
tabl(df,
  vars = c("am", "gear", "carb", "mpg"),
  qtiles = 4, # many-valued numerics converted to quantile
  labs.on = FALSE, # use values, not variable value labels
  sort.freq = FALSE, # sort by vars values (not frequencies)
  irreg.rm = TRUE, # NAs and the like are suppressed
  zero.rm = TRUE, # variable combinations that never occur are suppressed
  prop.digits = NULL, # return counts, not proportions
  max.unique.vals = 10
) # drop from table any var with >10 distinct values

# same as above, but include NA/irregular category values,
# zero.rm is TRUE; include unobserved (zero-count) category combinations
tabl(df,
  vars = c("am", "gear", "carb", "mpg"),
  qtiles = 4,
  labs.on = FALSE, # use values, not variable value labels
  sort.freq = TRUE, # sort by frequency
  irreg.rm = FALSE, # preserve/include NAs and irregular values
  zero.rm = FALSE, # include non-observed combinations
  prop.digits = NULL, # return counts, not proportions
  max.unique.vals = 10
) # drop from table any var with >10 distinct values

# show cross-tab view with wide.col arg
tabl(df,
  vars = c("cyl", "am"),
  labs.on = TRUE, # use variable value labels
  sort.freq = TRUE, # sort by vars values (not frequencies)
  irreg.rm = TRUE, # NAs and the like are suppressed
  prop.digits = NULL, # return counts, not proportions
  wide.col = "am"
) # use "am" as a column variable in a cross-tab view

tabl(df,
  vars = c("cyl", "am"),
  labs.on = TRUE, # use variable value labels
  sort.freq = TRUE, # sort by vars values (not frequencies)
  irreg.rm = TRUE, # NAs and the like are suppressed
  prop.digits = NULL, # return counts, not proportions
  wide.col = "cyl"
) # use "cyl" as a column variable in a cross-tab view

# verify select counts using base::subset()
nrow(subset(df, am == 0 & cyl == 4))
nrow(subset(df, am == 0 & cyl == 8))
nrow(subset(df, am == 1 & cyl == 8))
nrow(subset(df, am == 0 & cyl == 6))
nrow(subset(df, am == 1 & cyl == 6))

# will work on an un-labeled data.frame
tabl(mtcars, vars = c("am", "gear", "carb", "mpg"))

Run the code above in your browser using DataLab