Performs the atoms based distribution-free test for independence of two univariate random variables, which is computationally efficient for large data sets (recommended for sample size greater than 100).
Fast.independence.test(X,Y,NullTable=NULL,mmin=2,
mmax=min(10,length(X)), variant='ADP-EQP-ML',nr.atoms = min(40,length(X)),
combining.type='MinP',score.type='LikelihoodRatio',nr.perm=200,
compress=T, compress.p0=0.001, compress.p=0.99, compress.p1=0.000001)
Returns a UnivariateStatistic
class object, with the following entries:
The test statistic when the combining type is "MinP"
.
The p-value when the combining type is "MinP"
.
The partition size m for which the p-value was the smallest.
The test statistic when the combining type is "Fisher"
.
The p-value when the combining type is "Fisher"
.
The statistic for each m in the range mmin
to mmax
.
The p-values for each m in the range mmin
to mmax
.
The null table object. Null if NullTable
is non-null.
"Independence-Combined"
a character string specifying the partition type used in the test, one of "ADP"
or "DDP"
.
"sum"
or the aggregation type used by NullTable
a character string specifying the score typeused in the test, one of "LikelihoodRatio"
or "Pearson"
.
The maximum partition size of the ranked observations used for MinP or Fisher test statistic.
The minimum partition size of the ranked observations used for MinP or Fisher test statistic.
The minimum number of observations in a partition, only relevant for type="Independence"
, aggregation.type="Sum"
and score.type="Pearson"
.
The minimum number of observations in a partition, only relevant for type="Independence"
, aggregation.type="Max"
and score.type="Pearson"
.
The input nr.atoms
.
a numeric vector with observed X
values.
a numeric vector with observed Y
values.
The null table of the statistic, which can be downloaded from the software website or computed by the function Fast.independence.test.nulltable
.
The minimum partition size of the ranked observations, default value is 2. Ignored if NullTable
is non-null.
The maximum partition size of the ranked observations, default value is the minimum between 10 and the data size.
a character string specifying the partition type, must be one of "ADP-EQP"
or "ADP-EQP-ML"
(default). Ignored if NullTable
is non-null.
the number of atoms (i.e., possible split points in the data). Ignored if NullTable
is non-null. The default value is the minimum between \(n\) and \(40\).
a character string specifying the combining type, must be one of "MinP"
(default), "Fisher"
, or "both"
.
a character string specifying the score type, must be one of "LikelihoodRatio"
(default), "Pearson"
, or "both"
. Ignored if NullTable
is non-null.
The number of permutations for the null distribution. Ignored if NullTable
is non-null.
a logical variable indicating whether you want to compress the null tables. If TRUE, null tables are compressed: The lower compress.p
part of the null statistics is kept at a compress.p0
resolution, while the upper part is kept at a compress.p1
resolution (which is finer).
Parameter for compression. This is the resolution for the lower compress.p
part of the null distribution.
Parameter for compression. Part of the null distribution to compress.
Parameter for compression. This is the resolution for the upper value of the null distribution.
Barak Brill
This function is a smart wrapper for the hhg.univariate.ind.combined.test
function, with parameters optimized for a large number of observations.
The function first calls hhg.univariate.ind.stat
to compute the vector of test statistics. Test statistics are the sum of log-likelihood
scores, for All Derived Partitions (ADP) of the data (computed as explained in Heller et al. (2014)).
For the 'ADP-EQP-ML'
variant, the base test statistics are:
\(S_{2X2}, S_{2X3} ,S_{3X2}, ... ,S_{mmax X mmax}\).
For the 'ADP-EQP'
varint, only the sum of symmetric tables (same number of cell on both axis) is considered:
\(S_{2X2}, S_{3X3} ,S_{4X4}, ... ,S_{mmax X mmax}\)
Other variant are described in hhg.univariate.ind.combined.test
. The above varaiants are the ones to be used for a large number of observations (n>100).
Test functions are capable of handling large datasets by attempting a split only every \(N/nr.atoms\) observations. An atom is a sequence of observations which cannot be split when performing a partition of the data (i.e. setting nr.atoms
, the number of sequences which cannot be split, sets the number of equidistant partition points). For the above variants, 'EQP' stands for equipartition over atoms. Brill (2016) suggests a minimum of 40 atoms, with an increase of up to 60 for alternatives which are more difficult to detect (on the expense of computational complexity. Algorithm complexity is O(nr.atoms^4)). Very few alternatives require over 80 atoms.
The vector of \(S_{mXl}\) statistics is then combined according to the method suggested in Heller et al. (2014). The default combining type in the minimum p-value, so the test statistic is the minimum p-value over the range of partition sizes m from mmin
to mmax
, where the p-value for a fixed partition size m is defined by the aggregation type and score type. The combination is done over the statistics computed by hhg.univariate.ind.stat
. The second type of combination method for statistics, is via a Fisher type statistic, \(-\Sigma log(p_m)\) (with the sum going from \(mmin\) to \(mmax\)). The returned result may include the test statistic for the MinP
combination, the Fisher
combination, or both (see comb.type
).
If the argument NullTable
is supplied with a proper null table (constructed using
Fast.independence.test.nulltable
or hhg.univariate.ind.nulltable
, for the data sample size), test parameters are taken from NullTable
( mmax, mmin, variant, score.type, nr.atoms
,...). If NullTable
is left NULL
, a null table is generated by a call to Fast.independence.test.nulltable
using the arguments supplied to this function. Null table is generated with nr.perm
repetitions. It is stored in the returned object, under generated_null_table
. When testing for multiple hypotheses, one may generate only one null table (using this function or Fast.independence.test.nulltable
), and use it many times (thus, substantially reducing computation time). Generated null tables hold the distribution of statistics for both combination types, (comb.type=='MinP'
and comb.type=='Fisher'
).
Null tables may be compressed, using the compress
argument. For each of the partition sizes (i.e. m
or mXm
), the null distribution is held at a compress.p0
resolution up to the compress.p
percentile. Beyond that value, the distribution is held at a finer resolution defined by compress.p1
(since higher values are attained when a relation exists in the data, this is required for computing the p-value accurately.)
Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54 https://www.jmlr.org/papers/volume17/14-441/14-441.pdf
Brill B., Heller Y., and Heller R. (2018) Nonparametric Independence Tests and k-sample Tests for Large Sample Sizes Using Package HHG, R Journal 10.1 https://journal.r-project.org/archive/2018/RJ-2018-008/RJ-2018-008.pdf
Brill B. (2016) Scalable Non-Parametric Tests of Independence (master's thesis) https://tau.userservices.exlibrisgroup.com/discovery/delivery/972TAU_INST:TAU/12397000130004146?lang=he&viewerServiceCode=AlmaViewer
if (FALSE) {
N_Large = 1000
data_Large = hhg.example.datagen(N_Large, 'W')
X_Large = data_Large[1,]
Y_Large = data_Large[2,]
plot(X_Large,Y_Large)
NullTable_for_N_Large_MXL_tables = Fast.independence.test.nulltable(N_Large,
variant = 'ADP-EQP-ML', nr.atoms = 30,nr.perm=200)
ADP_EQP_ML_Result = Fast.independence.test(X_Large,Y_Large,
NullTable_for_N_Large_MXL_tables)
ADP_EQP_ML_Result
#the null distribution depends only on the sample size, so the same
#null table can be used for testing different hypotheses with the same sample size.
#For example, for another data set with N_Large sample size:
data_Large = hhg.example.datagen(N_Large, 'Circle')
X_Large = data_Large[1,]
Y_Large = data_Large[2,]
plot(X_Large,Y_Large)
#The MinP combining method p-values may be reported:
ADP_EQP_ML_Result = Fast.independence.test(X_Large,Y_Large,
NullTable_for_N_Large_MXL_tables,
combining.type='MinP')
ADP_EQP_ML_Result
#or both MinP and Fisher combining methods p-values:
ADP_EQP_ML_Result = Fast.independence.test(X_Large,Y_Large,
NullTable_for_N_Large_MXL_tables,
combining.type='Both')
ADP_EQP_ML_Result
}
Run the code above in your browser using DataLab