Multi Class Logistic Regression
===============================

Thusfar we have used logistic regression only for two class
problems. Then an hypothesis that returns just one value modelling the
a posteriori probability of assigning class $y=1$ to a feature vector
$\v x$ suffices (simply because the probability for class $y=0$ is the
complementary probability).

For a multi class problem things are different. Let the target value
$y$ be a value from $1$ to $C$ (for a $C$-class problem). Then we
could think of the multi class classification problem as a regression
problem. This is most often *not* a good idea.

Most regression algorithms take the output $\hat y =
h_{\v\theta}(\v x)$ to be an orderable value and most often a value
defined on an interval scale. This implies that for the loss function
the error between $y=1$ and $\hat y=2$ will be less than the error
between $y=1$ and $\hat y=4$. The scale of $y$ is definitely not
orderable in most situations let alone that differences are usefull
entities.

We will distinguish three ways to 'generalize' the two class approach
to $C$ classes. The first two ways do circumvent the problem described
above, they both implement multi class classification by using several
two class classifications. The third way is a real generalization as
it learns a vectorial hypothesis function (with $C$ elements) such
that these values represent a probability distribution over the
classes.


One vs All Multi Class
----------------------

For a $C$ class problem we train $C$ two class classifiers: For each
class $c$ we train a classifier to distinguish class $c$ from all
other classes. To make a prediction we calculate all $C$ hypotheses
and make the choice for the class with the largest value.


One vs One Multi Class
----------------------

Now we train a lot more classifiers. One for every pair of classes $c$
and $c'$ (with $c\not=c'$). For each classifier we then predict the
class and for of these predictions we select the class with most
'votes'.


Softmax Multi Class
-------------------

Softmax is nowadays the standard way to implement a multiclass
classification system. Also for multi class logistic
regression. Consider $C$ linear units, each with input $\v x$, i.e.

.. math::
   z_i = \v\theta_i\T \v x

The individual sigmoid units for each of the linear combinations $z_i$
are replaced with a **softmax layer**:

.. math::
   \hat y_i = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}

Note that by definition the $\hat y_i$ add up to one, so it can be
interpreted as the a posteriory probability function over the class
labels.

Also observe that for this scheme we need one-hot encoding of the
target value. For instance if $y=1$ the target vector should be
$(1\,0\cdots\,0)\T$. For $y=3$ the target vector should be
$(0\,0\,1\,0\cdots\,0)\T$.

The loss function $\ell$ for one training sample $\v x$ to be used in
these cases is the :doc:`cross entropy
</LectureNotes/Math/InformationTheory>`:

.. math::
   \ell(\v x) =
   - \sum_{i=1}^{C} \left(y_i \log \hat y_i + (1-y_i)\log(1-\hat y_i)\right)

where $y_i$ are the elements of the one-hot encoding of the target
class and $\hat y_i$ is the i-th element of the $C$ linear
combinations $\v\theta_i\T \v x$.