6.3.6. Multi Class Logistic Regression

Thusfar we have used logistic regression only for two class problems. Then an hypothesis that returns just one value modelling the a posteriori probability of assigning class \(y=1\) to a feature vector \(\v x\) suffices (simply because the probability for class \(y=0\) is the complementary probability).

For a multi class problem things are different. Let the target value \(y\) be a value from \(1\) to \(C\) (for a \(C\)-class problem). Then we could think of the multi class classification problem as a regression problem. This is most often not a good idea.

Most regression algorithms take the output \(\hat y = h_{\v\theta}(\v x)\) to be an orderable value and most often a value defined on an interval scale. This implies that for the loss function the error between \(y=1\) and \(\hat y=2\) will be less than the error between \(y=1\) and \(\hat y=4\). The scale of \(y\) is definitely not orderable in most situations let alone that differences are usefull entities.

We will distinguish three ways to ‘generalize’ the two class approach to \(C\) classes. The first two ways do circumvent the problem described above, they both implement multi class classification by using several two class classifications. The third way is a real generalization as it learns a vectorial hypothesis function (with \(C\) elements) such that these values represent a probability distribution over the classes.

6.3.6.1. One vs All Multi Class

For a \(C\) class problem we train \(C\) two class classifiers: For each class \(c\) we train a classifier to distinguish class \(c\) from all other classes. To make a prediction we calculate all \(C\) hypotheses and make the choice for the class with the largest value.

6.3.6.2. One vs One Multi Class

Now we train a lot more classifiers. One for every pair of classes \(c\) and \(c'\) (with \(c\not=c'\)). For each classifier we then predict the class and for of these predictions we select the class with most ‘votes’.

6.3.6.3. Softmax Multi Class

Softmax is nowadays the standard way to implement a multiclass classification system. Also for multi class logistic regression. Consider \(C\) linear units, each with input \(\v x\), i.e.

\[z_i = \v\theta_i\T \v x\]

The individual sigmoid units for each of the linear combinations \(z_i\) are replaced with a softmax layer:

\[\hat y_i = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}\]

Note that by definition the \(\hat y_i\) add up to one, so it can be interpreted as the a posteriory probability function over the class labels.

Also observe that for this scheme we need one-hot encoding of the target value. For instance if \(y=1\) the target vector should be \((1\,0\cdots\,0)\T\). For \(y=3\) the target vector should be \((0\,0\,1\,0\cdots\,0)\T\).

The loss function \(\ell\) for one training sample \(\v x\) to be used in these cases is the cross entropy:

\[\ell(\v x) = - \sum_{i=1}^{C} \left(y_i \log \hat y_i + (1-y_i)\log(1-\hat y_i)\right)\]

where \(y_i\) are the elements of the one-hot encoding of the target class and \(\hat y_i\) is the i-th element of the \(C\) linear combinations \(\v\theta_i\T \v x\).