Learn R Programming

HHG (version 2.3.7)

hhg.univariate.ks.combined.test: Distribution-free K-sample tests

Description

Performs distribution-free tests for equality of a univariate distribution across K groups.

Usage

hhg.univariate.ks.combined.test(X,Y=NULL,NullTable=NULL,mmin=2,
mmax=ifelse(is.null(Y),4,max(4,round(min(table(Y))/3))), aggregation.type='sum',
score.type='LikelihoodRatio' ,combining.type='MinP',nr.perm=1000,
variant='KSample-Variant', nr.atoms = nr_bins_equipartition(length(X)),
compress=F,compress.p0=0.001,compress.p=0.99,compress.p1=0.000001,keep.simulation.data=T)

Value

Returns a UnivariateStatistic class object, with the following entries:

MinP

The test statistic when the combining type is "MinP".

MinP.pvalue

The p-value when the combining type is "MinP".

MinP.m.chosen

The partition size m for which the p-value was the smallest.

Fisher

The test statistic when the combining type is "Fisher".

Fisher.pvalue

The p-value when the combining type is "Fisher".

m.stats

The statistic for each m in the range mmin to mmax.

pvalues.of.single.m

The p-values for each m in the range mmin to mmax.

generated_null_table

The null table object. Null if NullTable is non-null.

stat.type

"KSample-Combined"

aggregation.type

a character string specifying the aggregation type used in the , one of "sum" or "max".

score.type

a character string specifying the score typeused in the test, one of "LikelihoodRatio" or "Pearson".

mmax

The maximum partition size of the ranked observations used for MinP or Fisher test statistic.

mmin

The minimum partition size of the ranked observations used for MinP or Fisher test statistic.

nr.atoms

The input nr.atoms.

Arguments

X

A numeric vector of data values (tied observations are broken at random), or the test statistic as output from hhg.univariate.ks.stat.

Y

for k groups, a vector of integers with values 0:(k-1) which specify the group each observation belongs to. Leave as Null if the input to X is the test statistic.

NullTable

The null table of the statistic, which can be downloaded from the software website or computed by the function hhg.univariate.ks.nulltable.

mmin

The minimum partition size of the ranked observations, default value is 2. Ignored if NullTable is non-null.

mmax

The maximum partition size of the ranked observations, default value is 1/3 the number of observations in the smallest group. Ignored if NullTable is non-null.

aggregation.type

a character string specifying the aggregation type, must be one of "sum" (default), or "max". Ignored if NullTable is non-null or X is the test statistic.

score.type

a character string specifying the score type, must be one of "LikelihoodRatio" (default), or "Pearson". Ignored if NullTable is non-null or X is the test statistic.

combining.type

a character string specifying the combining type, must be one of "MinP" (default), "Fisher", or "both".

nr.perm

The number of permutations for the null distribution. Ignored if NullTable is non-null.

variant

Default value is 'KSample-Variant'. Setting the variant to 'KSample-Equipartition' performs the K-sample tests over partitions of the data where splits between cells are at least \(n/nr.atoms\) apart.

nr.atoms

If variant is 'KSample-Equipartition', this is the number of atoms (i.e., possible split points in the data). The default value is the minimum between \(n\) and \(60+0.5*\sqrt n \).

compress

a logical variable indicating whether you want to compress the null tables. If TRUE, the lower compress.p part of the null statistics is kept at a compress.p0 resolution, while the upper part is kept at a compress.p1 resolution (which is finer).

compress.p0

Parameter for compression. This is the resolution for the lower compress.p part of the null distribution.

compress.p

Parameter for compression. Part of the null distribution to compress.

compress.p1

Parameter for compression. This is the resolution for the upper value of the null distribution.

keep.simulation.data

a logical variable indicating whether in addition to the sorted statistics per column, the original matrix of size nr.replicates by mmax-mmin+1 is also stored.Ignored if NullTable is non-null.

Author

Barak Brill and Shachar Kaufman.

Details

The function outputs test statistics and p-values of the combined omnibus distribution-free test of equality of distributions among K groups, as described in Heller et al. (2014). The test combines statistics from a range of partition sizes. The default combining type is the minimum p-value, so the test statistic is the minimum p-value over the range of partition sizes m from mmin to mmax, where the p-value for a fixed partition size m is defined by the aggregation type and score type. The second type of combination method for statistics, is via a Fisher type statistic, \(-\Sigma log(p_m)\) (with the sum going from \(mmin\) to \(mmax\)). The returned result may include the test statistic for the MinP combination, the Fisher combination, or both (see comb.type).

If the argument NullTable is supplied with a proper null table (constructed using hhg.univariate.ks.nulltable, for the K groups sample sizes), then the following test parameters are taken from NullTable: ( mmax, mmin , variant, aggregation.type, score.type, nr.atoms ,...).

If NullTable is left NULL, a null table is generated by a call to hhg.univariate.ks.nulltable using the arguments supplied to this function. The null table is generated with nr.perm repetitions. It is stored in the returned object generated_null_table. When testing for multiple hypotheses with the same group sample sizes, it is computationally efficient to generate only one null table (using this function or hhg.univariate.ks.nulltable), and use it for all hypotehses testsed. Generated null tables hold the distribution of statistics for both combination types, (comb.type=='MinP' and comb.type=='Fisher').

If X is supplied with a statistic (UnivariateStatistic object, returned by hhg.univariate.ks.stat), X must have the statistics (by m), required by either NullTable or the user supplied arguments mmin and mmax. If X has a larger mmax argument than the supplied null table object, the statistics which exceed the null table's mmax are not taken into consideration when computing the combined statistic.

Variant type "KSample-Equipartition" is the atom based version of the K-sample test. Calculation time is reduced by aggregating over a subset of partitions, where a split between cells may be performed only every \(n/nr.atoms\) observations. Atom based tests are available when aggregation.type is set to 'sum' or 'max'.

Null tables may be compressed, using the compress argument. For each of the partition sizes, the null distribution is held at a compress.p0 resolution up to the compress.p percentile. Beyond that value, the distribution is held at a finer resolution defined by compress.p1 (since higher values are attained when a relation exists in the data, this is required for computing the p-value accurately in the tail of the null distribution.)

References

Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54 https://www.jmlr.org/papers/volume17/14-441/14-441.pdf

Brill B. (2016) Scalable Non-Parametric Tests of Independence (master's thesis) https://tau.userservices.exlibrisgroup.com/discovery/delivery/972TAU_INST:TAU/12397000130004146?lang=he&viewerServiceCode=AlmaViewer

Examples

Run this code
if (FALSE) {
#Two groups, each from a different normal mixture:
N0=30
N1=30
X = c(c(rnorm(N0/2,-2,0.7),rnorm(N0/2,2,0.7)),c(rnorm(N1/2,-1.5,0.5),rnorm(N1/2,1.5,0.5)))
Y = (c(rep(0,N0),rep(1,N1)))
plot(Y,X)

#I) Perform MinP & Fisher Tests - without existing null tables.
#Null tables are generated by the test function.

results = hhg.univariate.ks.combined.test(X,Y,nr.perm = 100)
results


#The null table can then be accessed.
generated.null.table = results$generated_null_table


#II)Perform MinP & Fisher Tests - with existing null tables. 

#null table for aggregation by summation: 
sum.nulltable = hhg.univariate.ks.nulltable(c(N0,N1), nr.replicates=1000) 

MinP.Sm.existing.null.table = hhg.univariate.ks.combined.test(X,Y,
NullTable = sum.nulltable)

#Results
MinP.Sm.existing.null.table

# combined test can also be performed by using the test statistic.
Sm.statistic = hhg.univariate.ks.stat(X,Y)
MinP.using.statistic = hhg.univariate.ks.combined.test(Sm.statistic,
NullTable = sum.nulltable)
# same result as above
MinP.using.statistic$MinP.pvalue

#null table for aggregation by maximization: 
max.nulltable = hhg.univariate.ks.nulltable(c(N0,N1), aggregation.type = 'max', 
  score.type='LikelihoodRatio', mmin = 2, mmax = 10, nr.replicates = 100)

#combined test using both "MinP" and "Fisher":
MinPFisher.Mm.result = hhg.univariate.ks.combined.test(X,Y,NullTable =  max.nulltable ,
  combining.type = 'Both')
MinPFisher.Mm.result


#III) Perform MinP & Fisher Tests for extremly large n

#Two groups, each from a different normal mixture, total sample size is 10^4:
X_Large = c(c(rnorm(2500,-2,0.7),rnorm(2500,2,0.7)),
c(rnorm(2500,-1.5,0.5),rnorm(2500,1.5,0.5)))
Y_Large = (c(rep(0,5000),rep(1,5000)))
plot(Y_Large,X_Large)


N0_large = 5000
N1_large = 5000

Sm.EQP.null.table = hhg.univariate.ks.nulltable(c(N0_large,N1_large), nr.replicates=200,
variant = 'KSample-Equipartition', mmax = 30)
Mm.EQP.null.table = hhg.univariate.ks.nulltable(c(N0_large,N1_large), nr.replicates=200,
aggregation.type='max', variant = 'KSample-Equipartition', mmax = 30)

MinPFisher.Sm.EQP.result = hhg.univariate.ks.combined.test(X_Large, Y_Large,
NullTable =  Sm.EQP.null.table ,
  combining.type = 'Both')
MinPFisher.Sm.EQP.result

MinPFisher.Mm.EQP.result = hhg.univariate.ks.combined.test(X_Large, Y_Large,
NullTable =  Mm.EQP.null.table ,
  combining.type = 'Both')
MinPFisher.Mm.EQP.result



}


Run the code above in your browser using DataLab