testIndSpeedglm(target, dataset, xIndex, csIndex, wei = NULL, dataInfo = NULL,
univariateModels = NULL, hash = FALSE, stat_hash = NULL, pvalue_hash = NULL,
target_type = 0, robust = FALSE)
If hash = TRUE, testIndSpeedglm requires the arguments 'stat_hash' and 'pvalue_hash' for the hash-based implementation of the statistic test. These hash Objects are produced or updated by each run of SES (if hash == TRUE) and they can be reused in order to speed up next runs of the current statistic test. If "SESoutput" is the output of a SES run, then these objects can be retrieved by SESoutput@hashObject$stat_hash and the SESoutput@hashObject$pvalue_hash.
Important: Use these arguments only with the same dataset that was used at initialization.
This test is designed for large sample sized data, tens and hundreds of thousands and it works for linear, logistic and poisson regression. The classical lm
and glm
functions will use too much memory when many observations are available. The package "speedglm" handles such data more efficiently. You can try and see, in the first case the computer will jam, whereas in the second it will not. Hence, this test is to be used in these cases only. We have not set a threshold on the sample size, so that the algorithm decides whether to shift to speedglm or not, because this depends upon the user's computing fascilities. When there are up to $20,000$ observations, the built-in function lm
is faster, but when $n=30,000$, the speedlm
is more than twice as fast.
For all the available conditional independence tests that are currently included on the package, please see "?CondIndTests".
SES, testIndLogistic, testIndReg, testIndPois, CondIndTests
dataset <- matrix(runif(40000 * 10, 1, 50), ncol = 10 )
#the target feature is the last column of the dataset as a vector
target <- rpois(40000, 10)
system.time( testIndPois(target, dataset, xIndex = 1, csIndex = 2) )
system.time( testIndSpeedglm(target, dataset, xIndex = 1, csIndex = 2) )
Run the code above in your browser using DataLab