Learn R Programming

cvms (version 1.7.0)

baseline_multinomial: Create baseline evaluations



Create a baseline evaluation of a test set.

In modelling, a baseline is a result that is meaningful to compare the results from our models to. For instance, in classification, we usually want our results to be better than random guessing. E.g. if we have three classes, we can expect an accuracy of 33.33%, as for every observation we have 1/3 chance of guessing the correct class. So our model should achieve a higher accuracy than 33.33% before it is more useful to us than guessing.

While this expected value is often fairly straightforward to find analytically, it only represents what we can expect on average. In reality, it's possible to get far better results than that by guessing. baseline_multinomial() finds the range of likely values by evaluating multiple sets of random predictions and summarizing them with a set of useful descriptors.

Technically, it creates one-vs-all (binomial) baseline evaluations for the `n` sets of random predictions and summarizes them. Additionally, sets of "all class x,y,z,..." predictions are evaluated.


  n = 100,
  metrics = list(),
  random_generator_fn = runif,
  parallel = FALSE


list containing:

  1. a tibble with summarized results (called summarized_metrics)

  2. a tibble with random evaluations (random_evaluations)

  3. a tibble with the summarized class level results (summarized_class_level_results)


Macro metrics

Based on the generated predictions, one-vs-all (binomial) evaluations are performed and aggregated to get the following macro metrics:

Balanced Accuracy, F1, Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, Kappa, Detection Rate, Detection Prevalence, and Prevalence.

In general, the metrics mentioned in binomial_metrics() can be enabled as macro metrics (excluding MCC, AUC, Lower CI, Upper CI, and the AIC/AICc/BIC metrics). These metrics also has a weighted average version.

N.B. we also refer to the one-vs-all evaluations as the class level results.

Multiclass metrics

In addition, the Overall Accuracy and multiclass MCC metrics are computed. Multiclass AUC can be enabled but is slow to calculate with many classes.


The Summarized Results

tibble contains:

Summary of the random evaluations.

How: The one-vs-all binomial evaluations are aggregated by repetition and summarized. Besides the metrics from the binomial evaluations, it also includes Overall Accuracy and multiclass


The Measure column indicates the statistical descriptor used on the evaluations. The Mean, Median, SD, IQR, Max, Min,

NAs, and INFs measures describe the Random Evaluations

tibble, while the CL_Max, CL_Min, CL_NAs, and

CL_INFs describe the Class Level results.

The rows where Measure == All_<<class name>> are the evaluations when all the observations are predicted to be in that class.


The Summarized Class Level Results

tibble contains:

The (nested) summarized results for each class, with the same metrics and descriptors as the Summarized Results

tibble. Use tidyr::unnest

on the tibble to inspect the results.

How: The one-vs-all evaluations are summarized by class.

The rows where Measure == All_0 are the evaluations when none of the observations are predicted to be in that class, while the rows where Measure == All_1 are the evaluations when all of the observations are predicted to be in that class.


The Random Evaluations

tibble contains:

The repetition results with the same metrics as the Summarized Results


How: The one-vs-all evaluations are aggregated by repetition. If a metric contains one or more NAs in the one-vs-all evaluations, it will lead to an NA result for that repetition.

Also includes:

A nested tibble with the one-vs-all binomial evaluations (Class Level Results), including nested Confusion Matrices and the

Support column, which is a count of how many observations from the class is in the test set.

A nested tibble with the predictions and targets.

A list of ROC curve objects.

A nested tibble with the multiclass confusion matrix.

A nested Process information object with information about the evaluation.

Name of dependent variable.





Name of dependent variable in the supplied test and training sets.


The number of sets of random predictions to evaluate. (Default is 100)


list for enabling/disabling metrics.

E.g. list("F1" = FALSE) would remove F1 from the results, and list("Accuracy" = TRUE) would add the regular Accuracy metric to the results. Default values (TRUE/FALSE) will be used for the remaining available metrics.

You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why f.i. list("all" = FALSE, "Accuracy" = TRUE) would return only the Accuracy metric.

The list can be created with multinomial_metrics().

Also accepts the string "all".


Function for generating random numbers. The softmax function is applied to the generated numbers to transform them to probabilities.

The first argument must be the number of random numbers to generate, as no other arguments are supplied.

To test the effect of using different functions, see multiclass_probability_tibble().


Whether to run the `n` evaluations in parallel. (Logical)

Remember to register a parallel backend first. E.g. with doParallel::registerDoParallel.


Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk


Packages used:

Multiclass ROC curve and AUC: pROC::multiclass.roc

See Also

Other baseline functions: baseline(), baseline_binomial(), baseline_gaussian()


Run this code
# \donttest{
# Attach packages
library(groupdata2) # partition()
library(dplyr) # %>% arrange()

# Data is part of cvms
data <- participant.scores

# Set seed for reproducibility

# Partition data
partitions <- partition(data, p = 0.7, list_out = TRUE)
train_set <- partitions[[1]]
test_set <- partitions[[2]]

# Create baseline evaluations
# Note: usually n=100 is a good setting

# Create some data with multiple classes
multiclass_data <- tibble(
  "target" = rep(paste0("class_", 1:5), each = 10)
) %>%

  test_data = multiclass_data,
  dependent_col = "target",
  n = 4

# Parallelize evaluations

# Attach doParallel and register four cores
# Uncomment:
# library(doParallel)
# registerDoParallel(4)

# Make sure to uncomment the parallel argument
(mb <- baseline_multinomial(
  test_data = multiclass_data,
  dependent_col = "target",
  n = 6
  #, parallel = TRUE  # Uncomment

# Inspect the summarized class level results
# for class_2
mb$summarized_class_level_results %>%
  dplyr::filter(Class == "class_2") %>%

# Multinomial with custom random generator function
# that creates very "certain" predictions
# (once softmax is applied)

rcertain <- function(n) {
  (runif(n, min = 1, max = 100)^1.4) / 100

# Make sure to uncomment the parallel argument
  test_data = multiclass_data,
  dependent_col = "target",
  n = 6,
  random_generator_fn = rcertain
  #, parallel = TRUE  # Uncomment
# }

Run the code above in your browser using DataLab