Originally developed for message communication, with the intuition that less likely events carry more information, defined for a single event as:


For a true label
where
This is also called log loss or binary cross-entropy
True positive: predicted positive, label was positive (
True negative: predicted negative, label was negative (
False positive: predicted positive, label was negative ( (type I)
False negative: predicted negative, label was positive ( (type II)
Accuracy is the fraction of correct predictions, given as:
Precision: Out of all the positive predictions, how many were correct?
Recall: Out of all the positive labels, how many were correct?
Specificity: Out of all the negative labels, how many were correct?
| Predicted Positive | Predicted Negative | |
|---|---|---|
| True Positive | TP | FN |
| True Negative | FP | TN |
The receiver operating characteristic curve is a plot of the true positive rate (recall or sensitivity) vs. false positive rate (1 - specificity) as the detection threshold changes
The diagonal is the same as random guessing
A perfect classifier would hug the top left corner
Fun fact: the name comes from WWII radar operators, where true positives were airplanes and false positives were noise

For binary classification, the sigmoid function
For multiclass classification, the softmax function is used:
where
This means that