"ORBoostFilter"(formula, data, ...)
"ORBoostFilter"(x, N = 20, d = 11, Naux = max(20, N), useDecisionStump = FALSE, classColumn = ncol(x), ...)
NULL
,
the optimal threshold is chosen according to the procedure described in Karmaker & Kwek. However, this can be
very time-consuming, and in most cases is little relevant for the final result.TRUE
, a decision stump is used as weak classifier.
Otherwise (default), naive-Bayes is applied. Recall decision stumps are not appropriate for multi-class problems.filter
, which is a list with seven components:
cleanData
is a data frame containing the filtered dataset.
remIdx
is a vector of integers indicating the indexes for
removed instances (i.e. their row number with respect to the original data frame).
repIdx
is a vector of integers indicating the indexes for
repaired/relabelled instances (i.e. their row number with respect to the original data frame).
repLab
is a factor containing the new labels for repaired instances.
parameters
is a list containing the argument values.
call
contains the original call to the filter.
extraInf
is a character that includes additional interesting
information not covered by previous items.
ORBoostFilter
method can be looked up in Karmaker & Kwek.
In general terms, a weak classifier is built in each iteration, and misclassified instances have their weight
increased for the next round. Instances are removed when their weight exceeds the
threshold d
, i.e. they have been misclassified in consecutive rounds.
Freund Y., Schapire R. E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.
# Next example is not run in order to save time
## Not run:
# data(iris)
# out <- ORBoostFilter(Species~., data = iris, N = 10)
# summary(out)
# identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])
# ## End(Not run)
Run the code above in your browser using DataLab