Learn R Programming

HHG (version 2.3.7)

hhg.univariate.ks.stat: The K-sample test statistics for all partition sizes

Description

These statistics are used in the omnibus distribution-free test of equality of distributions among K groups, as described in Heller et al. (2016).

Usage

hhg.univariate.ks.stat(x, y,variant = 'KSample-Variant',aggregation.type='sum',
score.type='LikelihoodRatio', mmax = max(4,round(min(table(y))/3)),mmin=2,
nr.atoms= nr_bins_equipartition(length(x)))

Value

Returns a UnivariateStatistic class object, with the following entries:

statistic

The value of the computed statistic if the score type is one of "LikelihoodRatio" or "Pearson", and the aggregation type is one of "sum" or "max". One of sum.chisq, sum.lr, max.chisq, and max.lr.

sum.chisq

A vector of size \(mmax-mmin+1\), where the \(m-mmin+1\) entry is the average over all Pearson chi-squared statistics from all the \(K X m\) contingency tables considered, divided by the total number of observations.

sum.lr

A vector of size \(mmax-mmin+1\), where the \(m-mmin+1\) entry is the average over all LikelihoodRatio statistics from all the \(K X m\) contingency tables considered, divided by the total number of observations.

max.chisq

A vector of size \(mmax-mmin+1\), where the \(m-mmin+1\) entry is the maximum over all Pearson chi-squared statistics from all the \(K X m\) contingency tables considered.

max.lr

A vector of size \(mmax-mmin+1\), where the \(m-mmin+1\) entry is the maximum over all Pearson chi-squared statistics from all the \(K X m\) contingency tables considered.

type

"KSample".

stat.type

"KSample".

size

A vector of size K of the ordered group sample sizes.

score.type

The input score.type.

aggregation.type

The input aggregation.type.

mmin

The input mmin.

mmax

The input mmax.

nr.atoms

The input nr.atoms.

Arguments

x

a numeric vector of data values. Tied observations are broken at random.

y

for k groups, a vector of integers with values 0:(k-1) which specify the group each observation belongs to.

variant

Default value is 'KSample-Variant'. Setting the variant to 'KSample-Equipartition' performs the K-sample tests over partitions of the data where splits between cells are at least \(n/nr.atoms\) apart.

aggregation.type

a character string specifying the aggregation type, must be one of "sum" (default), "max", or "both".

score.type

a character string specifying the score type, must be one of "LikelihoodRatio" (default), "Pearson", or "both".

mmax

The maximum partition size of the ranked observations, default value is 1/3 the number of observations in the smallest group.

mmin

The minimum partition size of the ranked observations, default value is 2.

nr.atoms

For variant=='KSample-Equipartition' type tests, sets the number of possible split points in the data. The default value is the minimum between \(n\) and \(60+0.5*\sqrt{n}\).

Author

Barak Brill and Shachar Kaufman.

Details

For each partition size \(m= mmin,\ldots,mmax\), the function computes the scores in each of the partitions (according to score type), and aggregates all scores according to the aggregation type (see details in Heller et al. , 2014). If the score type is one of "LikelihoodRatio" or "Pearson", and the aggregation type is one of "sum" or "max", then the computed statistic will be in statistic, otherwise the computed statistics will be in the appropriate subset of sum.chisq, sum.lr, max.chisq, and max.lr.

For the 'sum' aggregation type (default), The test statistic is the sum of log likelihood (or Pearson Chi-square) scores, of all partitions of size \(m\) of the data, normalized by the number of partitions and the data size (thus, being an estimator of the Mutual Information). For the 'max' aggregation type, the test statistic is the maximum log likelihood (or Pearson Chi-square) score acheived by a partition of data of size m.

Variant type "KSample-Equipartition" is the computationally efficient version of the K-sample test. calculation time is reducing by aggregating over a subset of partitions, where a split between cells may be performed only every \(n/nr.atoms\) observations. This allows for a complexity of O(nr.atoms^2) (instead of O(n^2)). Computationly efficient versions are available for aggregation.type=='sum' and aggregation.type=='max' variants.

References

Heller, R., Heller, Y., Kaufman S., Brill B, & Gorfine, M. (2016). Consistent Distribution-Free K-Sample and Independence Tests for Univariate Random Variables, JMLR 17(29):1-54

Brill B. (2016) Scalable Non-Parametric Tests of Independence (master's thesis) https://tau.userservices.exlibrisgroup.com/discovery/delivery/972TAU_INST:TAU/12397000130004146?lang=he&viewerServiceCode=AlmaViewer

Examples

Run this code
#Example of computing the test statisics for data from a two-sample problem:

#Two groups, each from a different normal mixture:
X = c(c(rnorm(25,-2,0.7),rnorm(25,2,0.7)),c(rnorm(25,-1.5,0.5),rnorm(25,1.5,0.5)))
Y = (c(rep(0,50),rep(1,50)))
plot(Y,X)


#I) Computing test statistics , with default parameters:
hhg.univariate.Sm.Likelihood.result = hhg.univariate.ks.stat(X,Y)

hhg.univariate.Sm.Likelihood.result

#II) Computing test statistics , with max aggregation type:
hhg.univariate.Mm.likelihood.result = hhg.univariate.ks.stat(X,Y,aggregation.type = 'max')

hhg.univariate.Mm.likelihood.result


#III) Computing statistics, which are computationaly efficient for large data:

#Two groups, each from a different normal mixture, total sample size is 10^4:
X_Large = c(c(rnorm(2500,-2,0.7),rnorm(2500,2,0.7)),
c(rnorm(2500,-1.5,0.5),rnorm(2500,1.5,0.5)))
Y_Large = (c(rep(0,5000),rep(1,5000)))
plot(Y_Large,X_Large)

# for these variants, make sure to change mmax so that mmax<= nr.atoms

hhg.univariate.Sm.EQP.Likelihood.result = hhg.univariate.ks.stat(X_Large,Y_Large,
variant = 'KSample-Equipartition',mmax=30)

hhg.univariate.Sm.EQP.Likelihood.result

hhg.univariate.Mm.EQP.likelihood.result = hhg.univariate.ks.stat(X_Large,Y_Large,
aggregation.type = 'max',variant = 'KSample-Equipartition',mmax=30)

hhg.univariate.Mm.EQP.likelihood.result

Run the code above in your browser using DataLab