tsne_grid: t-SNE grid search function

Description

This function allows you to search a perplexity hyperparameter range along with different seeds. Verbosity is automatic and cannot be removed. In case you need this function without verbosity, please compile the package after removing verbose messages.

Usage

tsne_grid(data, output_dims, input_dims = ncol(data),
  perplexity_range = c(1, min(floor((nrow(data) - 1)/3)), 1000), tries = 10,
  iterations = 10000, theta = 0, check_duplicates = FALSE, pca = FALSE,
  is_distance = FALSE)

Arguments

data

The data.frame input into t-SNE

output_dims

How many dimensions to output? (increases exponentially the computation time)

input_dims

How many input dimensions to use? (defaults to ncol(data)) - this should be changed when using pca to a value below the default value

perplexity_range

What hyperparameter interval to look for? (should be formatted as (min, max)) - defaults to c(1, min(floor((nrow(data)-1)/3)), 1000) - to grid search a seed for a fixed perplexity value, use min = max as inputs - the best pragmatic perpelxity for the lowest loss is typically floor((nrow(data)-1)/3). Make sure to avoid very high perplexity (like 1000) on large data (like 10000 observations). You might end up with a never ending tree creation, scaling quadratically (or even worse). By default, it is maxed to 1000.

tries

How many seeds to test t-SNE per perplexity value? (this increases linearly the computation time)

iterations

How many iterations per t-SNE are performed? (this increases approximately linearly the computation time)

theta

Use exact t-SNE (0) or Barnes-Hut t-SNE? (in ]0, 1] interval)

check_duplicates

Should t-SNE check for duplicates? (unlike common beliefs, t-SNE works perfectly with the existance of identical observations)

pca

Should a PCA (Principal Component Analysis) be performed? (note: it is performed every iteration, therefore it is computationally intensive and should be avoided - if you need PCA, please input the PCA instead of the data)

is_distance

Is the input a distance matrix? (assumes the diagonal cuts in half the input data.frame)

Value

A list with the best (lowest loss at a specific iteration) t-SNE elements from Rtsne

Examples

Run this code

#tsne_model <- tsne_grid(initial_diag = initial_diag, dims = 3,
#perplexity_range = c(floor((ncol(initial_diag)-1)/3), floor((ncol(initial_diag)-1)/3)),
#tries = 100, iterations = 10000, theta = 0.0, check_duplicates = FALSE,
#pca = FALSE, is_distance = TRUE)

Run the code above in your browser using DataLab