dataSimilarity: Evaluate statistical similarity of two data sets

Description

Use mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test to compare similarity of two data sets.

Usage

dataSimilarity(data1, data2, dropDiscrete=NA)

Arguments

data1

A data.frame containing the reference data.

data2

A data.frame with the same number and names of columns as data1.

dropDiscrete

A vector discrete attribute indices to skip in comparison. Typically we might skip class, because its distribution was forced by the user.

Value

The method returns a list of statistics computed on both data sets:

equalInstances

The number of instances in data2 equal to the instances in data1.

stats1num

A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of data1.

stats2num

A matrix with rows containing statistics (mean, standard deviation, skewness, and kurtosis) computed on numeric attributes of data2.

ksP

A vector with p-values of Kolmogorov-Smirnov two sample tests, performed on matching attributes from data1 and data2.

freq1

A list with value frequencies for discrete attributes in data1.

freq2

A list with value frequencies for discrete attributes in data2.

dfreq

A list with differences in frequencies of discrete attributes' values between data1 and data2.

dstatsNorm

A matrix with rows containing difference between statistics (mean, standard deviation, skewness, and kurtosis) computed on [0,1] normalized numeric attributes for data1 and data2.

hellingerDist

A vector with Hellinger distances between matching attributes from data1 and data2.

Details

The function compares data stored in data1 with data2 on per attribute basis by computing several statistics: mean, standard deviation, skewness, kurtosis, Hellinger distance and KS test.

Examples

Run this code

# NOT RUN {
# use iris data set, split into training and testing data
set.seed(12345)
train <- sample(1:nrow(iris),size=nrow(iris)*0.5)
irisTrain <- iris[train,]
irisTest <- iris[-train,]

# create RBF generator
irisGenerator<- rbfDataGen(Species~.,irisTrain)

# use the generator to create new data
irisNew <- newdata(irisGenerator, size=100)

# compare statistics of original and new data
dataSimilarity(irisTest, irisNew)

# }