6.1. Bayesian Classification

Let \(\v X\) be the random (feature) vector \(\v X=(X_1\cdots X_n)\T\) characterizing an object. The class of this object is characterized with the discrete random variable \(Y\). The goal in classification is to come up with a classification function \(\hat y = \classify(\v x)\) that assigns a class to a feature vector \(\v x\).

To quantify the succes of the classifier we introduce the loss function \(L\) such that \(L(\hat y, y)\) indicates the loss of an incorrect classification. The loss function is a mechanism to make a distinction in the errors a classifier can make. For a medical test it is often considered less of a problem if a patient is incorrectly diagnosed with a disease than the opposite case where the patient is declared healthy where in reality she is not.

In many classification problems the zero-one loss is used, then we take \(L(\hat y, y)=[\hat y\not=y]\). I.e. the loss function is equal to 1 in case the classifier is wrong and zero if it is right.

The squared error \(L(\hat y,y) = (\hat y-y)^2\) is also used as a loss function (for instance in neural nets) but is more often associated with regression.

The expected loss given feature vector \(\v X = \v x\) for a classifier \(\hat y = \classify(\v x)\) equals:

\[\mathcal L(\hat y;\v x) = \E(L(\hat y,Y)\given \v X=\v x) = \sum_y L(\hat y, y)\, \P(Y=y\given \v X = \v x)\]

The Bayesian classifier then finds the class \(\hat y\) with minimal expected loss:

\[\classify(\v x) = \arg\min_{\hat y} \mathcal L(\hat y; \v x) = \arg\min_{\hat y} \sum_y L(\hat y, y)\, \P(Y=y\given \v X = \v x)\]

We will look in somewhat more detail at the zero-one loss Bayesian classifier. First we show that the zero-one loss function leads to the Maximum a Posteriori classifier. Secondly we consider the Naive Bayes Classifier.