testIndSpeedglm(target,dataset,xIndex,csIndex,dataInfo = NULL,univariateModels = NULL,
hash = FALSE, stat_hash = NULL, pvalue_hash = NULL, target_type = 0, robust = FALSE)
lm
and glm
functions will use too much memory when many observations are available. The package "speedglm" handles such data more efficiently. You can try and see, in the first case the computer will jam, whereas in the second it will not. Hence, this test is to be used in these cases only. We have not set a threshold on the sample size, so that the algorithm decides whether to shift to speedglm or not, because this depends upon the user's computing fascilities. When there are up to $20,000$ observations, the built-in function lm
is faster, but when $n=30,000$, the speedlm
is more than twice as fast.
For all the available conditional independence tests that are currently included on the package, please see "?CondIndTests".SES, testIndLogistic, testIndReg, testIndPois, CondIndTests
#require(gRbase) #for faster computations in the internal functions
#simulate a dataset with categorical data
dataset <- matrix( sample(c(0, 1), 50 * 100000, replace = TRUE), ncol = 50)
#initialize categorical target
target <- dataset[, 50]
#remove target from the dataset
dataset <- dataset[, -50]
#run the conditional independence test for the nominal class variable
# check the runtimes between the two ways
system.time( results <- testIndSpeedglm(target, dataset, xIndex = 44, csIndex = c(10, 20),
target_type = 2) )
system.time( results <- testIndLogistic(target, dataset, xIndex = 44, csIndex = c(10, 20),
target_type = 2) )
Run the code above in your browser using DataLab