Learn R Programming

mlr3measures (version 1.0.0)

mcc: Matthews Correlation Coefficient

Description

Measure to compare true observed labels with predicted labels in multiclass classification tasks.

Usage

mcc(truth, response, positive = NULL, ...)

Value

Performance value as numeric(1).

Arguments

truth

(factor())
True (observed) labels. Must have the same levels and length as response.

response

(factor())
Predicted response labels. Must have the same levels and length as truth.

positive

(character(1)) Name of the positive class in case of binary classification.

...

(any)
Additional arguments. Currently ignored.

Meta Information

  • Type: "classif"

  • Range: \([-1, 1]\)

  • Minimize: FALSE

  • Required prediction: response

Details

In the binary case, the Matthews Correlation Coefficient is defined as $$ \frac{\mathrm{TP} \cdot \mathrm{TN} - \mathrm{FP} \cdot \mathrm{FN}}{\sqrt{(\mathrm{TP} + \mathrm{FP}) (\mathrm{TP} + \mathrm{FN}) (\mathrm{TN} + \mathrm{FP}) (\mathrm{TN} + \mathrm{FN})}}, $$ where \(TP\), \(FP\), \(TN\), \(TP\) are the number of true positives, false positives, true negatives, and false negatives respectively.

In the multi-class case, the Matthews Correlation Coefficient is defined for a multi-class confusion matrix \(C\) with \(K\) classes: $$ \frac{c \cdot s - \sum_k^K p_k \cdot t_k}{\sqrt{(s^2 - \sum_k^K p_k^2) \cdot (s^2 - \sum_k^K t_k^2)}}, $$ where

  • \(s = \sum_i^K \sum_j^K C_{ij}\): total number of samples

  • \(c = \sum_k^K C_{kk}\): total number of correctly predicted samples

  • \(t_k = \sum_i^K C_{ik}\): number of predictions for each class \(k\)

  • \(p_k = \sum_j^K C_{kj}\): number of true occurrences for each class \(k\).

The above formula is undefined if any of the four sums in the denominator is 0 in the binary case and more generally if either \(s^2 - \sum_k^K p_k^2\) or \(s^2 - \sum_k^K t_k^2)\) is equal to 0. The denominator is then set to 1.

When there are more than two classes, the MCC will no longer range between -1 and +1. Instead, the minimum value will be between -1 and 0 depending on the true distribution. The maximum value is always +1.

References

https://en.wikipedia.org/wiki/Phi_coefficient

Matthews BW (1975). “Comparison of the predicted and observed secondary structure of T4 phage lysozyme.” Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2), 442--451. tools:::Rd_expr_doi("10.1016/0005-2795(75)90109-9").

See Also

Other Classification Measures: acc(), bacc(), ce(), logloss(), mauc_aunu(), mbrier(), zero_one()

Examples

Run this code
set.seed(1)
lvls = c("a", "b", "c")
truth = factor(sample(lvls, 10, replace = TRUE), levels = lvls)
response = factor(sample(lvls, 10, replace = TRUE), levels = lvls)
mcc(truth, response)

Run the code above in your browser using DataLab