feemsplithalf: Split-half analysis of PARAFAC models

Description

This function validates PARAFAC with different numbers of components by means of splitting the data cube in halves, fitting PARAFAC to them and comparing the results albatross:::.Rdcite('DeSarbo1984').

Usage

feemsplithalf(
    cube, nfac, splits, random, groups, fixed, ..., progress = TRUE
  )
  # S3 method for feemsplithalf
plot(
    x, kind = c('tcc', 'factors', 'aggtcc', 'bandfactors'), ...
  )
  # S3 method for feemsplithalf
print(x, ...)
  # S3 method for feemsplithalf
coef(
    object, kind = c('tcc', 'factors', 'aggtcc', 'bandfactors'), ...
  )

Value

feemsplithalf, print.feemsplithalf

An object of class feemsplithalf, containing named fields:

factors: A list of feemparafac objects containing the factors of the halves. The list has dimensions, the first one corresponding to the halves (always 2), the second to different numbers of factors (as many as in nfac) and the third to different groupings of the samples (depends on splits or random).
tcc: A named list containing arrays of Tucker's congruence coefficients between the halves. Each entry in the list corresponds to an element in the nfac argument. The dimensions of each array in the list correspond to, in order: the factors (1 to nfac[i]), the modes (emission or excitation) and the groupings of the samples (depending on splits or random).
nfac: A copy of nfac argument.

plot.feemsplithalf

A lattice plot object. Its print or plot method will draw the plot on an appropriate plotting device.

coef.feemsplithalf

A data.frame containing various columns, depending on the value of the kind argument:

tcc

factor: The factor (out of nfac) under consideration.

tcc

Tucker's congruence coefficient between a pair of matching components. Out of two possible values (TCC between excitation loadings or emission loadings), the minimal one is chosen, because the same rule is used to find which components match when reordering them in a pair of models.

test

The sequence number for each pair of models in the split-half test, related to the third dimension of object$factors or object$tcc. May be used to group values for plotting or aggregation.

subset

Consists of two-element lists containing indices of the samples in each half of the original cube.

nfac

The number of factors in the pair of models under consideration.

factors

wavelength: Emission and excitation wavelengths.

value

The values of the loadings.

factor

Number of the factor, $1$ to nfac.

mode

The mode the loading value belongs to, “Emission” or “Excitation”.

nfac

Total number of factors.

test

Sequence number of a split-half test, indicating a given way to split the dataset in a group of splits with the same numbers of factors.

half

Number of the half, $1$ or $2$.

subset

For every row, this is an integer vector indicating the subset of the original data cube that the loadings have been obtained from.

aggtcc

The columns tcc, nfac, test after aggregation of coef(kind = 'tcc').

bandfactors

Columns wavelength, factor, mode, nfac from coef(kind = 'factors'), plus columns lower, estimate, upper signifying the outputs from the aggregation function.

Arguments

cube

A feemcube object.

nfac

An integer vector of numbers of factors to check.

splits

A scalar or a two-element vector consisting of whole numbers.

The first element is the number of parts to split the data cube into, which must be even. After splitting, the parts are recombined into non-intersecting halves albatross:::.Rdcite('Murphy2013'), which are subjected to PARAFAC decomposition and compared against each other.

The second element, if specified, limits the total number of comparisons between the pairs, since the number of potential ways to recombine the parts of the data cube into halves grows very quickly.

The number of PARAFAC models fitted is $ 2 \cdot \mathtt{splits[2]} $. If only splits[1] is specified, splits[2] defaults to $ \mathtt{splits[1]} \choose {\mathtt{splits[1]}/2} $.

Mutually incompatible with the parameters random, fixed.

random

Number of times to shuffle the dataset, split into non-intersecting halves, fit a PARAFAC model to each of the halves and compare halves against each other albatross:::.Rdcite('Krylov2020').

The number of PARAFAC models fitted is $2 \cdot \mathtt{random}$.

Mutually incompatible with the parameters splits, fixed.

groups

Use this argument to preserve the ratios between the groups present in the original dataset when constructing the halves. If specified, must be a factor or an integer vector of length dim(cube)[3] (specifying the group each sample belongs to) or a list of them, i.e., a valid f argument to split. By default, samples are considered to form a single group.

For the split-combine method (splits), each group must have at least splits elements; for best results, sizes of groups should be close to a multiple of splits. For the randomised split-half method (random), each group should have at least $2$ elements.

Mutually incompatible with the fixed parameter.

fixed

Use this argument to manually specify the contents of the halves to test. The argument must be a list containing two-element lists specifying the halves to compare. Each half must be a vector consisting of whole numbers specifying sample indices in the cube (see the example).

It is considered an error to specify a sample in both halves.

Mutually incompatible with the parameters splits, random, groups.

progress

Set to FALSE to disable the progress bar.

x, object

An object returned by feemsplithalf.

kind

Chooses what type of data to return or plot:

tcc

Between-half TCCs for different numbers of components. The smallest TCC is chosen between emission- and excitation-mode values, but otherwise they are not aggregated.

When plotting, TCC values for the component with the same number have the same colour.

factors

The resulting loading values.

When plotting, split the plot into panels per each number of components and each mode (emission or excitation). Components with the same number have the same colour.

aggtcc

aggregate the TCCs returned by coef(x, 'tcc') over individual components. By default, the function returns the minimal values, but a different aggregation can be chosen using the additional argument FUN.

When plotting, use a combination of a box-and-whiskers plot and a violin plot.

bandfactors

aggregate the factor values from coef(x, 'factors') over individual tests and halves. By default, collect the $2.5\%$ and $97.5\%$ quantiles as the lower and the upper boundaries, respectively, and medians as the estimates.

The additional argument FUN can be specified as a function returning c(lower, estimate, upper) given a numeric vector. The additional argument subset works as in base::subset.

When plotting, the estimates are lines on top of semi-transparent polygons signifying the lower and upper boundaries..

...

feemsplithalf

Remaining options are passed to feemparafac and, eventually, to parafac. It is recommended to fully name the parameters instead of relying on partial or positional matching.

The arguments parallel, cl are handled separately, see Details.

plot.feemsplithalf

Passed to xyplot.

print.feemsplithalf

No additional options are allowed.

coef.feemsplithalf

Ignored unless kind %in% c('aggtcc', 'bandfactors').

Details

As the models (loadings $\mathbf A$, $\mathbf B$ and scores $\mathbf C$) are fitted, they are compared to the first model of the same number of factors (Tucker's congruence coefficient is calculated using congru for emission and excitation mode factors, then the smallest value of the two is chosen for the purposes of matching). The models are first reordered according to the best match by TCC value, then rescaled albatross:::.Rdcite('Riu2003') by minimising $ || \mathbf A \, \mathrm{diag}(\mathbf s_\mathrm A) - \mathbf A^\mathrm{orig} ||^2 $ and $ || \mathbf{B} \, \mathrm{diag}(\mathbf s_\mathrm B) - \mathbf B^\mathrm{orig} ||^2 $ over $\mathbf s_\mathrm A$ and $\mathbf s_\mathrm B$, subject to $ \mathrm{diag}(\mathbf s_\mathrm A) \times \mathrm{diag}(\mathbf s_\mathrm B) \times \mathrm{diag}(\mathbf s_\mathrm C) = \mathbf I $, to make them comparable.

To perform stratified sampling on a real-valued variable (e.g. salinity, depth), consider binning samples into groups using cut, perhaps after histogram flattening using ecdf(x)(x). To determine the number of breaks, consider nclass.Sturges.

To conserve memory, feemsplithalf puts the user-provided cube in an environment and passes it via envir and subset options of feemparafac. This means that, unlike in feemparafac, the cube argument has to be a feemcube object and passing envir and subset options to feemsplithalf is not supported.

Instead of forwarding the arguments parallel, cl to multiway::parafac, feemsplithalf schedules the calls to feemparafac on the cluster by itself. This makes it possible to fit more than nstart models at the same time if enough nodes are present in the parallel cluster cl.

plot.feemsplithalf plots results of the split-half procedure (TCC or loading values depending on the kind argument) using lattice graphics. Sane defaults are provided for xyplot parameters xlab, ylab, as.table, but they can be overridden.

print.feemsplithalf displays a very short summary of the analysis, currently the minimum TCC value for each number of components.

coef.feemsplithalf returns the Tucker's congruence coefficients resulting from the split-half analysis.

References

albatross:::.Rdbibliography()

Examples

Run this code

# \donttest{
  data(feems)
  cube <- feemscale(feemscatter(cube, rep(14, 4)), na.rm = TRUE)

  (sh <- feemsplithalf(
    cube, 1:4, splits = 4, # => S4C6T3
    # splits = c(4, 2) would be S4C4T2, and so on
    # the rest is passed to multiway::parafac;
    ctol = 1e-4
    # here we set a mild stopping criterion for speed;
    # be sure to use a stricter one for real tasks
  ))

  # specifying fixed halves to compare as list of 2-element lists
  fixed <- list(
    list(1:6, 7:12),
    list(seq(1, 11, 2), seq(2, 12, 2))
  )
  sh.f <- feemsplithalf(cube, 2:3, fixed = fixed, ctol = 1e-4)

  plot(sh, 'aggtcc')
  head(coef(sh, 'factors'))
# }

Run the code above in your browser using DataLab

Description

Usage

Value

Arguments

Details

References

See Also

Examples