Multi Class Logistic Regression =============================== Thusfar we have used logistic regression only for two class problems. Then an hypothesis that returns just one value modelling the a posteriori probability of assigning class $y=1$ to a feature vector $\v x$ suffices (simply because the probability for class $y=0$ is the complementary probability). For a multi class problem things are different. Let the target value $y$ be a value from $1$ to $C$ (for a $C$-class problem). Then we could think of the multi class classification problem as a regression problem. This is most often *not* a good idea. Most regression algorithms take the output $\hat y = h_{\v\theta}(\v x)$ to be an orderable value and most often a value defined on an interval scale. This implies that for the loss function the error between $y=1$ and $\hat y=2$ will be less than the error between $y=1$ and $\hat y=4$. The scale of $y$ is definitely not orderable in most situations let alone that differences are usefull entities. We will distinguish three ways to 'generalize' the two class approach to $C$ classes. The first two ways do circumvent the problem described above, they both implement multi class classification by using several two class classifications. The third way is a real generalization as it learns a vectorial hypothesis function (with $C$ elements) such that these values represent a probability distribution over the classes. One vs All Multi Class ---------------------- For a $C$ class problem we train $C$ two class classifiers: For each class $c$ we train a classifier to distinguish class $c$ from all other classes. To make a prediction we calculate all $C$ hypotheses and make the choice for the class with the largest value. One vs One Multi Class ---------------------- Now we train a lot more classifiers. One for every pair of classes $c$ and $c'$ (with $c\not=c'$). For each classifier we then predict the class and for of these predictions we select the class with most 'votes'. Softmax Multi Class ------------------- Softmax is nowadays the standard way to implement a multiclass classification system. Also for multi class logistic regression. Consider $C$ linear units, each with input $\v x$, i.e. .. math:: z_i = \v\theta_i\T \v x The individual sigmoid units for each of the linear combinations $z_i$ are replaced with a **softmax layer**: .. math:: \hat y_i = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}} Note that by definition the $\hat y_i$ add up to one, so it can be interpreted as the a posteriory probability function over the class labels. Also observe that for this scheme we need one-hot encoding of the target value. For instance if $y=1$ the target vector should be $(1\,0\cdots\,0)\T$. For $y=3$ the target vector should be $(0\,0\,1\,0\cdots\,0)\T$. The loss function $\ell$ for one training sample $\v x$ to be used in these cases is the :doc:`cross entropy `: .. math:: \ell(\v x) = - \sum_{i=1}^{C} \left(y_i \log \hat y_i + (1-y_i)\log(1-\hat y_i)\right) where $y_i$ are the elements of the one-hot encoding of the target class and $\hat y_i$ is the i-th element of the $C$ linear combinations $\v\theta_i\T \v x$.