Learn R Programming

qwraps2 (version 0.6.1)

confusion_matrix: Confusion Matrices (Contingency Tables)

Description

Construction of confusion matrices, accuracy, sensitivity, specificity, confidence intervals (Wilson's method and (optional bootstrapping)).

Usage

confusion_matrix(
  ...,
  thresholds = NULL,
  confint_method = "logit",
  alpha = getOption("qwraps2_alpha", 0.05)
)

# S3 method for default confusion_matrix( truth, predicted, ..., thresholds = NULL, confint_method = "logit", alpha = getOption("qwraps2_alpha", 0.05) )

# S3 method for formula confusion_matrix( formula, data = parent.frame(), ..., thresholds = NULL, confint_method = "logit", alpha = getOption("qwraps2_alpha", 0.05) )

# S3 method for glm confusion_matrix( x, ..., thresholds = NULL, confint_method = "logit", alpha = getOption("qwraps2_alpha", 0.05) )

# S3 method for qwraps2_confusion_matrix print(x, ...)

Value

confusion_matrix returns a list with elements

  • cm_stats a data.frame with columns:

  • auroc numeric value for the area under the receiver operating curve

  • auroc_ci a numeric vector of length two with the lower and upper bounds for a 100(1-alpha)% confidence interval about the auroc

  • auprc numeric value for the area under the precision recall curve

  • auprc_ci a numeric vector of length two with the lower and upper limits for a 100(1-alpha)% confidence interval about the auprc

  • confint_method a character string reporting the method used to build the auroc_ci and auprc_ci

  • alpha the alpha level of the confidence intervals

  • prevalence the proportion of the input of positive cases, that is (TP + FN) / (TP + FN + FP + TN) = P / (P + N)

Arguments

...

pass through

thresholds

a numeric vector of thresholds to be used to define the confusion matrix (one threshold) or matrices (two or more thresholds). If NULL the unique values of predicted will be used.

confint_method

character string denoting if the logit (default), binomial, or Wilson Score method for deriving confidence intervals

alpha

alpha level for 100 * (1 - alpha)% confidence intervals

truth

a integer vector with the values 0 and 1, or a logical vector. A value of 0 or FALSE is an indication of condition negative; 1 or TRUE is an indication of condition positive.

predicted

a numeric vector. See Details.

formula

column (known) ~ row (test) for building the confusion matrix

data

environment containing the variables listed in the formula

x

a glm object

Details

The confusion matrix:

TrueCondition
+-
Predicted Condition+TPFP
Predicted Condition-FNTN

where

  • FN: False Negative = truth = 1 & prediction < threshold,

  • FP: False Positive = truth = 0 & prediction >= threshold,

  • TN: True Negative = truth = 0 & prediction < threshold, and

  • TP: True Positive = truth = 1 & prediction >= threshold.

The statistics returned in the cm_stats element are:

  • accuracy = (TP + TN) / (TP + TN + FP + FN)

  • sensitivity, aka true positive rate or recall = TP / (TP + FN)

  • specificity, aka true negative rate = TN / (TN + FP)

  • positive predictive value (PPV), aka precision = TP / (TP + FP)

  • negative predictive value (NPV) = TN / (TN + FN)

  • false negative rate (FNR) = 1 - Sensitivity

  • false positive rate (FPR) = 1 - Specificity

  • false discovery rate (FDR) = 1 - PPV

  • false omission rate (FOR) = 1 - NPV

  • F1 score

  • Matthews Correlation Coefficient (MCC) = ((TP * TN) - (FP * FN)) / sqrt((TP + FP) (TP+FN) (TN+FP) (TN+FN))

Synonyms for the statistics:

  • Sensitivity: true positive rate (TPR), recall, hit rate

  • Specificity: true negative rate (TNR), selectivity

  • PPV: precision

  • FNR: miss rate

Sensitivity and PPV could, in some cases, be indeterminate due to division by zero. To address this we will use the following rule based on the DICE group https://github.com/dice-group/gerbil/wiki/Precision,-Recall-and-F1-measure: If TP, FP, and FN are all 0, then PPV, sensitivity, and F1 will be defined to be 1. If TP are 0 and FP + FN > 0, then PPV, sensitivity, and F1 are all defined to be 0.

Examples

Run this code

# Example 1: known truth and prediction status
df <-
  data.frame(
      truth = c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)
    , pred  = c(1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0)
  )

confusion_matrix(df$truth, df$pred, thresholds = 1)

# Example 2: Use with a logistic regression model
mod <- glm(
  formula = spam ~ word_freq_our + word_freq_over + capital_run_length_total
, data = spambase
, family = binomial()
)

confusion_matrix(mod)
confusion_matrix(mod, thresholds = 0.5)

Run the code above in your browser using DataLab