6.1.2. Naive Bayes Classifier

6.1.2.1. Discrete Naive Bayes Classifier

We start with the simple example of a Naive Bayesian spam classifier. An (email) text message is characterized with a \(n\) boolean random variables. Each variable \(W_i\) has value 1 in case the \(i\) th word from a dictionary is present in the email. Here we take the dictionary for granted. In a real spam filtering system setting up the dictionary and mapping the text onto the words in the dictionary if of course a very important aspect of such a system.

Let \(S\) be the random variable that characterizes a spam email: \(S=1\) is spam and \(S=0\) is a non spam email.

A Bayesian classifier classifies an email characterized with \(W_1=w_1,\dots,W_n=w_n\) as spam in case:

\[\P(S=1\given W_1=w_1,\dots,W_n=w_n) \geq \P(S=0\given W_1=w_1,\dots,W_n=w_n)\]

The joint probability function for this problem is

\[p_{W_1\cdots W_n\,S}(w_1,\ldots,w_n,s) = \P(W_1=w_1,\dots,W_n=w_n, S=s)\]

This joint probability function encoded as a table giving the value for each possible combination of the variables is quite big. The table will have \(2^{1001}\) entries. Way too big to learn.

Using Bayes rule we first rewrite the classicification rule:

\[\frac{\P(W_1=w_1,\dots,W_n=w_n \given S=1) \P(S=1)}{\P(W_1=w_1,\dots,W_n=w_n)} \geq \frac{\P(W_1=w_1,\dots,W_n=w_n \given S=0) \P(S=0)}{\P(W_1=w_1,\dots,W_n=w_n)}\]

The a priori probability for the data \(\P(W_1=w_1,\dots,W_n=w_n)\) is called the evidence and can be left out on both sides of the equation.

\[\P(W_1=w_1,\dots,W_n=w_n \given S=1) \P(S=1) \geq \P(W_1=w_1,\dots,W_n=w_n \given S=0) \P(S=0)\]

But still this does not lead to an improvement. We now have two tables each of \(2^{1000}\) entries.

The Naive Bayesian Classifier assumes that the \(W_i\) ‘s are (conditionally) independent, i.e.

\[\P(W_1=w_1,\dots,W_n=w_n \given S=s) = \prod_{i=1}^{n} \P(W_i=w_i \given S=s)\]

Using this naive assumption leads to the classifier:

\[\P(S=1)\,\prod_{i=1}^{n} \P(W_i=w_i \given S=1) \geq \P(S=0)\,\prod_{i=1}^{n} \P(W_i=w_i \given S=0)\]

Now we have \(2n+2\) probabilities that need to be known to use such a classifier. These probabilities can easily be learned from data.

Often the classifier is written as:

\[\frac{\prod_{i=1}^{n} \P(W_i=w_i \given S=1)}{ \prod_{i=1}^{n} \P(W_i=w_i \given S=0)} \geq \frac{\P(S=0)}{\P(S=1)}\]

In the case of a spam filter it turns out that in practice \(\P(S=1)\gg\P(S=0)\). Instead of using the a priori probabilities for spam and non-spam a threshold \(T\) is chosen.

\[\frac{\prod_{i=1}^{n} \P(W_i=w_i \given S=1)}{ \prod_{i=1}^{n} \P(W_i=w_i \given S=0)} \geq T\]

In practice the value of \(T\) is learned from the data. In that situation a value of \(T\) can be selected that not only considers the number of classification errors but that weigh the two types of error differently. In the case of spam filtering it is better to misclassify some span emails as non-spam then to classify non-spam emails as spam (and never read them…).

6.1.2.2. Continuous Naive Bayes Classifier

We now consider the case where we have \(n\) continuous random variables \(X_1,\ldots,X_n\) and based upon their values have to classify the object into one of \(k\) classes \(C=1\) to \(C=k\).

The MAP classifier then is:

\[c^\star = \arg \max_c \P(C=c\given X_1=x_1, \ldots, X_n=x_n)\]

Using Bayes rule we can write:

\[\P(C=c\given X_1=x_1, \ldots, X_n=x_n) = \frac{f_{X_1\cdots X_n\given C=c}(x_1,\ldots,x_n) \, \P(C=c)}{f_{X_1\cdots X_n}(x_1,\ldots,x_n)}\]

Leading to:

\[\begin{split}c^\star &= \arg \max_c \frac{f_{X_1\cdots X_n\given C=c}(x_1,\ldots,x_n) \, \P(C=c)}{f_{X_1\cdots X_n}(x_1,\ldots,x_n)}\\ &= \arg \max_c f_{X_1\cdots X_n\given C=c}(x_1,\ldots,x_n) \, \P(C=c)\end{split}\]

The last equation is true because the evidence for the data is not dependent of the class. Note carefully that the value that is being maximized in this last expression is not a probability anymore.

The naive assumption again is that the feature values \(X_1,\ldots,X_n\) are class conditionally independent, i.e.

\[f_{X_1\cdots X_n\given C=c}(x_1,\ldots,x_n) = \prod_{i=1}^{n} f_{X_i\given C=c}(x_i)\]

Using the Naive Bayes classifier for continuous random variables in practice is most often done in case we may assume that the \(X_i\) are distributed according to some parameterized pdf. Then we need to estimate the parameters for the distribution of each of the classes.