"PF"(formula, data, ...)
"PF"(x, nfolds = 5, consensus = FALSE, p = 0.01, s = 3, y = 0.5, theta = 0.7, classColumn = ncol(x), ...)
p
. The filter stops
after s
iterations with not enough noisy instances removed (according to the proportion p
).filter
, which is a list with seven components:
cleanData
is a data frame containing the filtered dataset.
remIdx
is a vector of integers indicating the indexes for
removed instances (i.e. their row number with respect to the original data frame).
repIdx
is a vector of integers indicating the indexes for
repaired/relabelled instances (i.e. their row number with respect to the original data frame).
repLab
is a factor containing the new labels for repaired instances.
parameters
is a list containing the argument values.
call
contains the original call to the filter.
extraInf
is a character that includes additional interesting
information not covered by previous items.
nfolds
partitions of data
. After a
'good rules selection' process based on the accuracy of each rule, the subsequent good rules sets are
tested in the whole dataset, and the removal of noisy instances is decided via consensus or
majority voting schemes. Finally, a proportion of good instances (i.e. those whose label agrees
with all the base classifiers) is stored and not considered in subsequent iterations. The process stops
after s
iterations with not enough (according to the proportion p
) noisy
instances removed.
Zhu X., Wu X., Chen Q. (2006): Bridging local and global data cleansing: Identifying class noise in large, distributed data datasets. Data mining and Knowledge discovery, 12(2-3), 275-308.
# Next example is not run in order to save time
## Not run:
# data(iris)
# # We fix a seed since there exists a random partition for the ensemble
# set.seed(1)
# out <- PF(Species~., data = iris, s = 1, nfolds = 3)
# print(out)
# identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])
# ## End(Not run)
Run the code above in your browser using DataLab