FeatureLookup(data, label, ban = NULL, antiban = FALSE, type = "auto",
split = "information", folds = 5, seed = 0, verbose = TRUE,
plots = TRUE, max_depth = 4, min_split = max(20, nrow(data)/1000),
min_bucket = round(min_split/3), min_improve = 0.01,
competing_splits = 2, surrogate_search = 5, surrogate_type = 2,
surrogate_style = 0)
NULL
, which means no variables are banned (all variables are potentially used for the decision tree).TRUE
, the ban
transforms into a selection (which bans all other variables not "banned" initially). Defaults to FALSE
."class"
), regression ("anova"
), count ("poisson"
), or survival ("exp"
). Defaults to "auto"
, which will attempt to find the base type (classification / regression) of model to create using simple heuristics.type = "class"
), then the split must be either set to "gini"
(for Gini index) or "information"
(for Information Gain) as the splitting rule. Defaults to "information"
as it is less biased than "gini"
when it comes to cardinalities.label
is also valid.competing_splits + surrogate_search
rows will be printed. Defaults to TRUE
.TRUE
.3
. Any value greater than 30
will cause issues on 32-bit operating systems due to C code.max(20, nrow(data) / 1000)
, which is the maximum between 20 and the 0.1% of the number of observations.round(min_split/3)
, which means by defaults at least 7 to approximately 0.033% of the number of observations.min_improve
. For classification, the purity (issued from Gini or Information Gain) must increase by at least min_improve
.verbose = TRUE
, each node will have competing_splits
rules printed, if they are adequate enough (instead of only one splitting rule). This allows the user to lookup for more details. Defaults to 4
.verbose = TRUE
. Defaults to 5
.0
, any surrogates with missing values are not used for the tree. If set to 1
, when all surrogates are with missing values, they are not used the tree. If set to 2
, when all surrogates are not used, the majority rule is used (Breiman tree). Sparse frames should preferably use 2
. It is recommended to use 2
as it handles better missing values, which is the default. Set to 0
if you need to ignore as much as possible missing values.1
, any missing values in the surrogate is removed to compute the correctness of the surrogate. If set to 0
, it ignores any missing values and takes into account all observations to compute the correctness of the surrogate. Defaults to 0
. Set to 1
if you need to ignore as much as possible missing values.rpart
model.max_depth
to a very small value (like 3
). This ensures interpretability. Moreover, if you have a sparse frame (with lot of missing values), it is important to keep an eye at surrogate_type
and surrogate_style
as they will dictate whether a split point will be made depending on the missing values. Default values are made to handle them appropriately. However, if your intent is to penalize missing values (for instance if missing values are anomalies), changing their values respectively to 0
and 1
is recommended.## Not run: ------------------------------------
# # An example of a heavily regularized decision tree
# # Settings are intentionally difficult enough for a decision tree
# # This way, only great split points are reported
# FeatureLookup(data,
# label,
# ban = c("CAR", "TOBACCO"),
# antiban = FALSE,
# type = "anova",
# folds = 20,
# seed = 11111,
# verbose = TRUE,
# plots = TRUE,
# max_depth = 3,
# min_split = 1000,
# min_bucket = 200,
# min_improve = 0.10,
# competing_splits = 10,
# surrogate_search = 10,
# surrogate_type = 2,
# surrogate_style = 0)
## ---------------------------------------------
Run the code above in your browser using DataLab