diagram_ksvm: Fit a support vector machine model where each training set instance is a persistence diagram.

Description

Returns the output of kernlab's ksvm function on the Gram matrix of the list of persistence diagrams in a particular dimension.

Usage

diagram_ksvm(
  diagrams,
  cv = 1,
  dim,
  t = 1,
  sigma = 1,
  rho = NULL,
  y,
  type = NULL,
  distance_matrices = NULL,
  C = 1,
  nu = 0.2,
  epsilon = 0.1,
  prob.model = FALSE,
  class.weights = NULL,
  fit = TRUE,
  cache = 40,
  tol = 0.001,
  shrinking = TRUE,
  num_workers = parallelly::availableCores(omit = 1)
)

Value

a list of class 'diagram_ksvm' containing the elements

cv_results

the cross-validation results - a matrix storing the parameters for each model in the tuning grid and its mean cross-validation error over all splits.

best_model

a list containing the output of ksvm run on the whole dataset with the optimal model parameters found during cross-validation, as well as the optimal kernel parameters for the model.

diagrams

the diagrams which were supplied in the function call.

Arguments

diagrams: a list of persistence diagrams which are either the output of a persistent homology calculation like ripsDiag/calculate_homology/PyH, or diagram_to_df.
cv: a positive number at most the length of `diagrams` which determines the number of cross validation splits to be performed (default 1, aka no cross-validation). If `prob.model` is TRUE then cv is set to 1 since kernlab performs 3-fold CV internally in this case. When performing classification, classes are balanced within each cv fold.
dim: a non-negative integer vector of homological dimensions in which the model is to be fit.
t: either a vector of positive numbers representing the grid of values for the scale of the persistence Fisher kernel or NULL, default 1. If NULL then t is selected automatically, see details.
sigma: a vector of positive numbers representing the grid of values for the bandwidth of the Fisher information metric, default 1.
rho: an optional positive number representing the heuristic for Fisher information metric approximation, see diagram_distance. Default NULL. If supplied, distance matrix calculations are sequential.
y: a response vector with one label for each persistence diagram. Must be either numeric or factor, but doesn't need to be supplied when `type` is "one-svc".
type: a string representing the type of task to be performed. Can be any one of "C-svc","nu-svc","one-svc","eps-svr","nu-svr" - default for regression is "eps-svr" and for classification is "C-svc". See ksvm for details.
distance_matrices: an optional list of precomputed Fisher distance matrices, corresponding to the rows in `expand.grid(dim = dim,sigma = sigma)`, default NULL.
C: a number representing the cost of constraints violation (default 1) this is the 'C'-constant of the regularization term in the Lagrange formulation.
nu: numeric parameter needed for nu-svc, one-svc and nu-svr. The `nu` parameter sets the upper bound on the training error and the lower bound on the fraction of data points to become Support Vector (default 0.2).
epsilon: epsilon in the insensitive-loss function used for eps-svr, nu-svr and eps-bsvm (default 0.1).
prob.model: if set to TRUE builds a model for calculating class probabilities or in case of regression, calculates the scaling parameter of the Laplacian distribution fitted on the residuals. Fitting is done on output data created by performing a 3-fold cross-validation on the training data. For details see references (default FALSE).
class.weights: a named vector of weights for the different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). All components have to be named.
fit: indicates whether the fitted values should be computed and included in the model or not (default TRUE).
cache: cache memory in MB (default 40).
tol: tolerance of termination criteria (default 0.001).
shrinking: option whether to use the shrinking-heuristics (default TRUE).
num_workers: the number of cores used for parallel computation, default is one less the number of cores on the machine.

Author

Shael Brown - shaelebrown@gmail.com

Details

Cross validation is carried out in parallel, using a trick noted in tools:::Rd_expr_doi("10.1007/s41468-017-0008-7") - since the persistence Fisher kernel can be written as \(d_{PF}(D_1,D_2)=exp(t*d_{FIM}(D_1,D_2))=exp(d_{FIM}(D_1,D_2))^t\), we can store the Fisher information metric distance matrix for each sigma value in the parameter grid to avoid recomputing distances, and cross validation is therefore performed in parallel. Note that the response parameter `y` must be a factor for classification - a character vector for instance will throw an error. If `t` is NULL then 1/`t` is selected as the 1,2,5,10,20,50 percentiles of the upper triangle of the distance matrix of its training sample (per fold in the case of cross-validation). This is the process suggested in the persistence Fisher kernel paper. If any of these values would divide by 0 (i.e. if the training set is small) then the minimum non-zero element is taken as the denominator (and hence the returned parameters may have duplicate rows except for differing error values). If cross-validation is performed then the mean error across folds is still recorded, but the best `t` parameter across all folds is recorded in the cv results table.

References

Murphy, K. "Machine learning: a probabilistic perspective." MIT press (2012).

Examples

Run this code


if(require("TDAstats"))
{
  # create four diagrams
  D1 <- TDAstats::calculate_homology(TDAstats::circle2d[sample(1:100,20),],
                      dim = 1,threshold = 2)
  D2 <- TDAstats::calculate_homology(TDAstats::circle2d[sample(1:100,20),],
                      dim = 1,threshold = 2)
  D3 <- TDAstats::calculate_homology(TDAstats::sphere3d[sample(1:100,20),],
                      dim = 1,threshold = 2)
  D4 <- TDAstats::calculate_homology(TDAstats::sphere3d[sample(1:100,20),],
                      dim = 1,threshold = 2)
  g <- list(D1,D2,D3,D4)

  # create response vector
  y <- as.factor(c("circle","circle","sphere","sphere"))

  # fit model without cross validation
  model_svm <- diagram_ksvm(diagrams = g,cv = 1,dim = c(0),
                            y = y,sigma = c(1),t = c(1),
                            num_workers = 2)
}

Run the code above in your browser using DataLab