Conditional independence test for continuous, binary and count data with thousands of samples: Conditional independence test for continuous, binary and discrete (counts) variables with thousands of observations

Description

The main task of this test is to provide a p-value PVALUE for the null hypothesis: feature 'X' is independent from 'TARGET' given a conditioning set CS. The pvalue is calculated by comparing a logistic model based on the conditioning set CS against a model whose regressor are both X and CS. The comparison is performed through a chi-square test with the aproprirate degrees of freedom on the difference between the deviances of the two models.

Usage

testIndSpeedglm(target, dataset, xIndex, csIndex, wei = NULL, dataInfo = NULL, 
univariateModels = NULL, hash = FALSE, stat_hash = NULL, pvalue_hash = NULL, 
target_type = 0, robust = FALSE)

Arguments

target

A numeric vector containing the values of the target variable. It can be either continuous or percentages (values within 0 and 1), binary or discrete (counts).

dataset

A numeric matrix or data frame, in case of categorical predictors (factors), containing the variables for performing the test. Rows as samples and columns as features.

xIndex

The index of the variable whose association with the target we want to test.

csIndex

The indices of the variables to condition on.

wei

A vector of weights to be used for weighted regression. The default value is NULL.

dataInfo

A list object with information on the structure of the data. Default value is NULL.

univariateModels

Fast alternative to the hash object for univariate test. List with vectors "pvalues" (p-values), "stats" (statistics) and "flags" (flag = TRUE if the test was succesful) representing the univariate association of each variable with the target. Default value is NULL.

hash

A boolean variable which indicates whether (TRUE) or not (FALSE) to use the hash-based implementation of the statistics of SES. Default value is FALSE. If TRUE you have to specify the stat_hash argument and the pvalue_hash argument.

stat_hash

A hash object (hash package required) which contains the cached generated statistics of a SES run in the current dataset, using the current test.

pvalue_hash

A hash object (hash package required) which contains the cached generated p-values of a SES run in the current dataset, using the current test.

target_type

A numeric vector that represents the type of the target. Default value is 0. See details for more.

target_type = 1 (binary target)
target_type = 2 (nominal target)
target_type = 3 (discrete target)

robust

A boolean variable which indicates whether (TRUE) or not (FALSE) to use a robustified version of the logistic regressions available here. Currently it is not available for these cases.

Value

A list including: A list including:

Details

If argument target_type=0 then testIndSpeedglm requires the dataInfo argument to indicate the type of the current target:

dataInfo$target_type = "normal" (continuous target)
dataInfo$target_type = "binary" (binary target)
dataInfo$target_type = "discrete" (discrete target)

If hash = TRUE, testIndSpeedglm requires the arguments 'stat_hash' and 'pvalue_hash' for the hash-based implementation of the statistic test. These hash Objects are produced or updated by each run of SES (if hash == TRUE) and they can be reused in order to speed up next runs of the current statistic test. If "SESoutput" is the output of a SES run, then these objects can be retrieved by SESoutput@hashObject$stat_hash and the SESoutput@hashObject$pvalue_hash.

Important: Use these arguments only with the same dataset that was used at initialization.

This test is designed for large sample sized data, tens and hundreds of thousands and it works for linear, logistic and poisson regression. The classical lm and glm functions will use too much memory when many observations are available. The package "speedglm" handles such data more efficiently. You can try and see, in the first case the computer will jam, whereas in the second it will not. Hence, this test is to be used in these cases only. We have not set a threshold on the sample size, so that the algorithm decides whether to shift to speedglm or not, because this depends upon the user's computing fascilities. When there are up to $20,000$ observations, the built-in function lm is faster, but when $n=30,000$, the speedlm is more than twice as fast.

For all the available conditional independence tests that are currently included on the package, please see "?CondIndTests".

References

McCullagh, Peter, and John A. Nelder. Generalized linear models. CRC press, USA, 2nd edition, 1989.

Examples

Run this code

dataset <- matrix(runif(40000 * 10, 1, 50), ncol = 10 ) 
#the target feature is the last column of the dataset as a vector
target <- rpois(40000, 10)
system.time( testIndPois(target, dataset, xIndex = 1, csIndex = 2) )
system.time( testIndSpeedglm(target, dataset, xIndex = 1, csIndex = 2) )

Run the code above in your browser using DataLab