binarizeCategoricalColumns: Turn categorical columns into sets of binary indicators

Description

Given a data frame with (some) categorical columns, this function creates a set of indicator variables for the various possible sets of levels.

Usage

binarizeCategoricalColumns(
   data,
   convertColumns = NULL,
   considerColumns = NULL,
   maxOrdinalLevels = 3,
   levelOrder = NULL,
   minCount = 3,
   val1 = 0, val2 = 1,
   includePairwise = FALSE,
   includeLevelVsAll = TRUE,
   dropFirstLevelVsAll = TRUE,
   dropUninformative = TRUE,
   includePrefix = TRUE,
   prefixSep = ".",
   nameForAll = "all",
   levelSep = NULL,
   levelSep.pairwise = if (length(levelSep)==0) ".vs." else levelSep,
   levelSep.vsAll = if (length(levelSep)==0) 
              (if (nameForAll=="") "" else ".vs.") else levelSep,
   checkNames = FALSE,
   includeLevelInformation = FALSE)
binarizeCategoricalColumns.pairwise(
   data, 
   maxOrdinalLevels = 3,
   convertColumns = NULL,
   considerColumns = NULL,
   levelOrder = NULL,
   val1 = 0, val2 = 1, 
   includePrefix = TRUE,
   prefixSep = ".", 
   levelSep = ".vs.",
   checkNames = FALSE)
binarizeCategoricalColumns.forRegression(
   data, 
   maxOrdinalLevels = 3,
   convertColumns = NULL,
   considerColumns = NULL,
   levelOrder = NULL,
   val1 = 0, val2 = 1,
   includePrefix = TRUE,
   prefixSep = ".",
   checkNames = TRUE)
binarizeCategoricalColumns.forPlots(
   data, 
   maxOrdinalLevels = 3,
   convertColumns = NULL,
   considerColumns = NULL,
   levelOrder = NULL,
   val1 = 0, val2 = 1,
   includePrefix = TRUE,
   prefixSep = ".")

Arguments

data

A data frame.

convertColumns

Optional character vector giving the column names of the columns to be converted. See maxOrdinalLevels below.

considerColumns

Optional character vector giving the column names of columns that should be looked at and possibly converted. If not given, all columns will be considered. See maxOrdinalLevels below.

maxOrdinalLevels

When convertColumns above is NULL, the function looks at all columns in considerColumns and converts all non-numeric columns and those numeric columns that have at most maxOrdinalLevels unique values. A column is considered numeric if its storage mode is numeric or if it is character and all entries with the expception of "NA", "NULL" and "NO DATA" represent valid numbers.

levelOrder

Optional list giving the ordering of levels (unique values) in each of the converted columns. Best used in conjunction with convertColumns.

minCount

Levels of x for which there are fewer than minCount elements will be ignored.

val1

Value for the lower level in binary comparisons.

val2

Value for the higher level in binary comparisons.

includePairwise

Logical: should pairwise binary indicators be included? For each pair of levels, the indicator is val1 for the lower level (earlier in levelOrder), val2 for the higher level and NA otherwise.

includeLevelVsAll

Logical: should binary indicators for each level be included? The indicator is val2 where x equals the level and val1 otherwise.

dropFirstLevelVsAll

Logical: should the column representing first level vs. all be dropped? This makes the resulting matrix of indicators usable for regression models.

dropUninformative

Logical: should uninformative (constant) columns be dropped?

includePrefix

Logical: should the column name of the binarized column be included in column names of the output? See details.

prefixSep

Separator of column names and level names in column names of the output. See details.

nameForAll

Character string that represents "all others" in the column names of indicators of level vs. all others.

levelSep

Separator for levels to be used in column names of the output. If NULL, pairwise and level vs. all indicators will use different level separators set by levelSep.pairwise and levelSep.vsAll.

levelSep.pairwise

Separator for levels to be used in column names for pairwise indicators in the output.

levelSep.vsAll

Separator for levels to be used in column names for level vs. all indicators in the output.

checkNames

Logical: should the names of the output be made into syntactically correct R language names?

includeLevelInformation

Logical: should information about which levels are represented by which columns be included in the attributes of the output?

Value

A data frame in which the converted columns have been replaced by sets of binarized indicators. When includeLevelInformation is TRUE, the attribute includedLevels is a table with one column per output column and two rows, giving the two levels (unique values of x) represented by the column.

Details

binarizeCategoricalColumns is the most general function, the rest are convenience wrappers that set some of the options to achieve the following:

binarizeCategoricalColumns.pairwise returns only pairwise (level vs. level) binary indicators.

binarizeCategoricalColumns.forRegression returns only level vs. all others binary indicators, with the first (according to levelOrder) level vs. all removed. This is essentially the same as would be returned by model.matrix except for the column representing intercept.

binarizeCategoricalColumns.forPlots returns only level vs. all others binary indicators and keeps them all.

The columns to be converted are identified as follows. If considerColumns is given, columns not contained in it will not be converted, even if they are included in convertColumns.

If convertColumns is given, those columns will be converted (except any not contained in non-empty considerColumns). If convertColumns is NULL, the function converts columns that are not numeric (as reported by is.numeric) and those numeric columns that have at most maxOrdinalValues unique non-missing values.

The function creates two types of indicators. The first is one level (unique value) of x vs. all others, i.e., for a given level, the indicator is val2 (usually 1) for all elements of x that equal the level, and val1 (usually 0) otherwise. Column names for these indicators are the concatenation of namePrefix, the level, nameSep and nameForAll. The level vs. all indicators are created for all levels that have at least minCounts samples, are present in levelOrder (if it is non-NULL) and are not included in ignore.

The second type of indicator encodes binary comparisons. For each pair of levels (both with at least minCount samples), the indicator is val2 (usually 1) for the higher level and val1 (usually 0) for the lower level. The level order is given by levelOrder (which defaults to the sorted levels of x), assumed to be sorted in increasing order. All levels with at least minCount samples that are included in levelOrder and not included in ignore are included.

Internally, the function calls binarizeCategoricalVariable for each column that is converted.

Examples

Run this code

# NOT RUN {
set.seed(2);
x = data.frame(a = sample(c("A", "B", "C"), 15, replace = TRUE),
               b = sample(c(1:3), 15, replace = TRUE));
out = binarizeCategoricalColumns(x, includePairwise = TRUE, includeLevelVsAll = TRUE,
                     includeLevelInformation = TRUE);
data.frame(x, out);
attr(out, "includedLevels")

# }

Run the code above in your browser using DataLab