Learn R Programming

semiArtificial (version 2.4.1)

cleanData: Rejection of new instances based on their distance to existing instances

Description

The function contains three data cleaning methods, the first two reject instances whose distance to their nearest neighbors in the existing data are too small or too large. The first checks distance between instances disregarding class, the second checks distances between instances taking only instances from the same class into account. The third method reassigns response variable using the prediction model stored in the generator teObject.

Usage

cleanData(teObject, newdat, similarDropP=NA, dissimilarDropP=NA, 
          similarDropPclass=NA, dissimilarDropPclass=NA, 
		  nearestInstK=1, reassignResponse=FALSE, cleaningObject=NULL)

Arguments

teObject

An object of class TreeEnsemble containing a generator structure as returned by treeEnsemble. The teObject contains generator's training instances from which we compute a distance distribution of instances to their nearestInsK nearest instances. This distance distribution, computed on the training data of the generator, serves as a criterion to reject new instances from newdata, i.e. based on parameters below we reject the instances too close or to far away from their nearest neighbors in generator's training data. The computed distance distributions are stored and returned as cleaningObject component of returned list. If it is provided on subsequent calls, this reduces computational load.

newdat

A data.frame object with the (newly generated) data to be cleaned.

similarDropP

With numeric parameters similarDropP and dissimilarDropP (with the default value NA and the valid value range in [0, 1]) one removes instances in newdat too close to generator's training instances or too far away from these instances. The distance distribution is computed based on instances stored in teObject. For each instance in $teObject$ we store the distance to its nearestInsK nearest instances (disregarding the identical instances). These distances are sorted and represent a distribution of nearest distances for all training instances. The values similarDropP and dissimilarDropP represent a proportion of allowed smaller/larger distances computed on the generator's training data contained in the teObject.

dissimilarDropP

See similarDropP.

similarDropPclass

For classification problems only and similarly to the similarDropP and dissimilarDropP above, with the similarDropPclass and dissimilarDropPclass (also in a [0, 1] range) we also removes instances in newdat too close to generator's training instances or too far away from these instances, but only taking near instances from the same class into account. The similarDropPclass contains either a single integer giving thresholds for all class values or a vector of thresholds, one for each class. If the vector is of insufficient length it is replicated using function rep. The generated distance distributions are stored in the cleaningObject component of the returned list.

dissimilarDropPclass

See similarDropPclass.

nearestInstK

An integer with default value of 1, controls how many generator's training instances we take into account when computing the distance distribution of nearest instances.

reassignResponse

is a logical value controlling whether the response variable of the newdat shall be set anew using a random forest prediction model or taken as it is. The default value reassign=FALSE means that values of response are not changed.

cleaningObject

is a list object with a precomputed distance distributions and predictor from previous runs of the same function. If provided, this saves computation time.

Value

The method returns a list object with two components:

cleanData

is a data.frame containing the instances left after rejection of too close or too distant instances from newdata.

cleaningObject

is a list containing computed distributions of nearest distances (also class-based fro classification problems, and possibly a predictor used for reassigning the response variable.

Details

The function uses the training instances stored in the generator teObject to compute distribution of distances from instances to their nearestInstK nearest instances. For classification problems the distributions can also be computed only for instances from the same class. Using these near distance distributions the function rejects all instances too close or too far away from existing instances.

The default value of similarDropP, dissimilarDropP, similarDropPclass, and dissimilarDropPclass is NA and means that the near/far values are not rejected. The same effect has value 0 for similarDropP and similarDropPclass, and value 1 for dissimilarDropP and dissimilarDropPclass.

See Also

treeEnsemble, newdata.TreeEnsemble.

Examples

Run this code
# NOT RUN {
# inspect properties of the iris data set
plot(iris, col=iris$Species)
summary(iris)

irisEnsemble<- treeEnsemble(Species~.,iris,noTrees=10)

# use the generator to create new data with the generator
irisNewEns <- newdata(irisEnsemble, size=150)

#inspect properties of the new data
plot(irisNewEns, col = irisNewEns$Species) #plot generated data
summary(irisNewEns)

clObj <- cleanData(irisEnsemble, irisNewEns, similarDropP=0.05, dissimilarDropP=0.95, 
                   similarDropPclass=0.05, dissimilarDropPclass=0.95, 
		           nearestInstK=1, reassignResponse=FALSE, cleaningObject=NULL) 
head(clObj$cleanData)
# }

Run the code above in your browser using DataLab