8.1. Evaluating Classifiers
We will be looking at a binary (two class) classifier for which Bayes classification rule tells us:
or equivalently:
In our evaluation of the classifier it is not important whether the a posteriori probability is calculated by estimating the class conditional probability (density) of the data or is estimated directly with an hypothesis function \(h_{\v\theta}\).
The important thing is that we can make this into a parameterized classifier by selecting a classification threshold (or decision threshold) different than \(1/2\):
The results of the classifier on a dataset can be summarized in a confusion matrix where we call \(y=1\) POSITIVE and \(y=0\) NEGATIVE (often in classification the \(y\) value indicates POSITIVE or NEGATIVE (TRUE or FALSE, PASS or FAIL, etc).
\(\quad\)
\(y=0\)
\(y=1\)
\(\hat y = 0\)
TN
FN
\(\hat y = 1\)
FP
TP
where
- TP:
stands for True Positives indicating the number of elements \(\v x\) in the test set that are correctly (TRUE) classified as 1 (POSITIVE), i.e. \(\hat y=1\) and \(y=1\).
- TN:
stands for True Negatives, the number of elements that are correctly classified as NEGATIVE, i.e. \(\hat y=0\) and \(y=0\).
- FP:
stands for False Positive indicating the number of elements that are incorrectly classified as positive, i.e. \(\hat y=1\) and \(y=0\).
- FN:
stands for False Negative, the number of elements that are incorrectly classified as negaive, i.e. \(\hat y=0\) and \(y=1\).
From the confusion matrix we can easily calculate the total accuracy as the fraction of correctly classified examples, i.e.
Achieving the maximum accuracy of a classifier is not always the goal. Consider a decision where the false positive error has more severe negative consequences than a false positive error. For instance when testing a patient on the corona virus it is better to tolerate some more false positives (thinking a patient might have covid-19) and have less false negative errors. Note that a false negative means that a patient who has covid-19 is not treated as such. By selecting a threshold \(t\) we can balance FP and FN errors (possibly at the expense of some of the accuracy).
In situations where the True Positives are much more likely than the True Negatives the accuracy also isn’t very indicative. Consider for instance spam classification where around 99% of the emails are spam. Setting a classifier in that case to always return a positive result (spam in this case) leads to a 99% accuracy.
In these situations the precision and recall measures might be more interesting. Precision is defined as:
i.e. the fraction of positive classifications that were actual positive as well.
The recall is defined as:
i.e. what fraction of actual positives were classified as such (again note that FN indicate the actual positive cases that are classified as negative).