Learn R Programming

TDApplied (version 3.0.3)

bootstrap_persistence_thresholds: Estimate persistence threshold(s) for topological features in a data set using bootstrapping.

Description

Bootstrapping is used to find a conservative estimate of a 1-`alpha` percent "confidence interval" around each point in the persistence diagram of the data set, and points whose intervals do not touch the diagonal (birth == death) would be considered "significant" or "real". One threshold is computed for each dimension in the diagram.

Usage

bootstrap_persistence_thresholds(
  X,
  FUN_diag = "calculate_homology",
  FUN_boot = "calculate_homology",
  maxdim = 0,
  thresh,
  distance_mat = FALSE,
  ripser = NULL,
  ignore_infinite_cluster = TRUE,
  calculate_representatives = FALSE,
  num_samples = 30,
  alpha = 0.05,
  return_subsetted = FALSE,
  return_pvals = FALSE,
  return_diag = TRUE,
  num_workers = parallelly::availableCores(omit = 1),
  p_less_than_alpha = FALSE
)

Value

either a numeric vector of threshold values, with one for each dimension 0..`maxdim` (in that order), or a list containing those thresholds and elements (if desired)

Arguments

X

the input dataset, must either be a matrix or data frame.

FUN_diag

a string representing the persistent homology function to use for calculating the full persistence diagram, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.

FUN_boot

a string representing the persistent homology function to use for calculating the bootstrapped persistence diagrams, either 'calculate_homology' (the default), 'PyH' or 'ripsDiag'.

maxdim

the integer maximum homological dimension for persistent homology, default 0.

thresh

the positive numeric maximum radius of the Vietoris-Rips filtration.

distance_mat

a boolean representing if `X` is a distance matrix (TRUE) or not (FALSE, default). dimensions together (TRUE, the default) or if one threshold should be calculated for each dimension separately (FALSE).

ripser

the imported ripser module when `FUN_diag` or `FUN_boot` is `PyH`.

ignore_infinite_cluster

a boolean indicating whether or not to ignore the infinitely lived cluster when `FUN_diag` or `FUN_boot` is `PyH`.

calculate_representatives

a boolean representing whether to calculate representative (co)cycles, default FALSE. Note that representatives cant be calculated when using the 'calculate_homology' function.

num_samples

the positive integer number of bootstrap samples, default 30.

alpha

the type-1 error threshold, default 0.05.

return_subsetted

a boolean representing whether or not to return the subsetted persistence diagram (with or without representatives), default FALSE.

return_pvals

a boolean representing whether or not to return p-values for features in the subsetted diagram, default FALSE.

return_diag

a boolean representing whether or not to return the calculated persistence diagram, default TRUE.

num_workers

the integer number of cores used for parallelizing (over bootstrap samples), default one less the maximum amount of cores on the machine.

p_less_than_alpha

a boolean representing whether or not subset further and return only feature whose p-values are strictly less than `alpha`, default `FALSE`. Note that this is not part of the original bootstrap procedure.

Author

Shael Brown - shaelebrown@gmail.com

Details

The thresholds are then determined by calculating the 1-`alpha'` percentile of the bottleneck distance values between the real persistence diagram and other diagrams obtained by bootstrap resampling the data. Since `ripsDiag` is the slowest homology engine but is the only engine which calculates representative cycles (as opposed to co-cycles with `PyH`), two homology engines are input to this function - one to calculate the actual persistence diagram, `FUN_diag` (possibly with representative (co)cycles) and one to calculate the bootstrap diagrams, `FUN_boot` (this should be a faster engine, like `calculate_homology` or `PyH`). p-values can be calculated for any feature which survives the thresholding if both `return_subsetted` and `return_pvals` are `TRUE`, however these values may be larger than the original `alpha` value in some cases. Note that this is not part of the original bootstrap procedure. If stricter thresholding is desired, or the p-values must be less than `alpha`, set `p_less_than_alpha` to `TRUE`. The minimum possible p-value is always 1/(`num_samples` + 1). Note that since calculate_homology can ignore the longest-lived cluster, fewer "real" clusters may be found. To avoid this possibility try setting `FUN_diag` equal to 'ripsDiag'. Please note that due to the TDA package no longer being available on CRAN, if `FUN_diag` or `FUN_boot` are 'ripsDiag' then `bootstrap_persistence_thresholds` will look for the ripsDiag function in the global environment, so the TDA package should be attached with `library("TDA")` prior to use.

References

Chazal F et al (2017). "Robust Topological Inference: Distance to a Measure and Kernel Distance." https://www.jmlr.org/papers/volume18/15-484/15-484.pdf.

Examples

Run this code

if(require("TDAstats"))
{
  # create a persistence diagram from a sample of the unit circle
  df <- TDAstats::circle2d[sample(1:100,size = 50),]

  # calculate persistence thresholds for alpha = 0.05 
  # and return the calculated diagram as well as the subsetted diagram
  bootstrapped_diagram <- bootstrap_persistence_thresholds(X = df,
  maxdim = 1,thresh = 2,num_workers = 2)
}

Run the code above in your browser using DataLab