nearest_datasets: Select nearest datasets given input `x`.

Description

If `x` is a data.frame object, computes dataset characteristics. If `x` is a character object specifying dataset name from PMLB, use the already computed dataset statistics/characteristics in `summary_stats`.

Usage

nearest_datasets(x, ...)
# S3 method for default
nearest_datasets(x, ...)
# S3 method for character
nearest_datasets(
  x,
  n_neighbors = 5,
  dimensions = c("n_instances", "n_features"),
  target_name = "target",
  ...
)
# S3 method for data.frame
nearest_datasets(
  x,
  y = NULL,
  n_neighbors = 5,
  dimensions = c("n_instances", "n_features"),
  task = c("classification", "regression"),
  target_name = "target",
  ...
)

Value

Character string of names of most similar datasets to df, most similar dataset first.

Arguments

x: Character string of dataset name from PMLB, or data.frame of n_samples x n_features(or n_features+1 with a target column)
...: Further arguments passed to each method.
n_neighbors: Integer. The number of dataset names to return as neighbors.
dimensions: Character vector specifying dataset characteristics to include in similarity calculation. Dimensions must correspond to numeric columns of [all_summary_stats.tsv](https://github.com/EpistasisLab/pmlb/blob/master/pmlb/all_summary_stats.tsv). If 'all' (default), uses all numeric columns.
target_name: Character string specifying column of target/dependent variable.
y: Vector of target column. Required when `x`` does not contain the target column.
task: Character string specifying classification or regression for summary stat generation.

Examples

Run this code

if (interactive()){
  nearest_datasets('penguins')
  nearest_datasets(fetch_data('penguins'))
}

Run the code above in your browser using DataLab