6.3.3. Gradient Descent

Logistic regression finds the optimal \(\v\theta\) given by:

\[\hat{\v\theta} = \arg\max_{\v\theta} \log( \ell(\v\theta))\]

where

\[\ell(\v\theta) = \prod_{i=1}^{m} \left(h_{\v\theta}(\v x\ls i)\right)^{y\ls i} \, \left(1-h_{\v\theta}(\v x\ls i)\right)^{1-y\ls i}\]

The maximum likelihood problem can be casted into the minimization of a cost function by taking the negative likelihood and taking the average over \(m\) examples:

\[\begin{split}J(\v\theta) &= - \frac{1}{m} \log( \ell(\v\theta))\\ &= \frac{1}{m}\sum_{i=1}^{m} \left( - y\ls i\,\log h_{\v\theta}(\v x\ls i) - (1 - y\ls i)\log(1-h_{\v\theta}(\v x\ls i)) \right)\end{split}\]

The gradient is:

\[\frac{\partial J(\v\theta)}{\partial \v\theta} = \frac{1}{m} \sum_{i=1}^{m} \left( - y\ls i\,\frac{\partial}{\partial \v\theta}\log h_{\v\theta}(\v x\ls i) - (1 - y\ls i)\frac{\partial}{\partial \v\theta}\log(1-h_{\v\theta}(\v x\ls i)) \right)\]

For the first partial derivative in the above expression we have:

\[\begin{split}\frac{\partial}{\partial \v\theta}\log h_{\v\theta}(\v x\ls i) &= \frac{\partial}{\partial \v\theta} \log g(\v\theta\T \tilde{\v x}\ls i)\\ &= \frac{1}{g(\v\theta\T \tilde{\v x}\ls i)}\,g'(\v\theta\T \tilde{\v x}\ls i)\, \tilde{\v x}\ls i\end{split}\]

For the logistic function we have \(g'(v) = g(v)\,(1-g(v))\) and thus:

\[\begin{split}\frac{\partial}{\partial \v\theta}\log h_{\v\theta}(\v x\ls i) &= (1 - g(\v\theta\T \tilde{\v x}\ls i))\,\tilde{\v x}\ls i\\\end{split}\]

For the second partial derivative we get:

\[\begin{split}\frac{\partial}{\partial \v\theta}\log(1-h_{\v\theta}(\v x\ls i)) &= \frac{\partial}{\partial \v\theta}\log(1-g(\v\theta\T \tilde{\v x}\ls i))\\ &= \frac{-1}{1-g(\v\theta\T \tilde{\v x}\ls i)}\, g'(\v\theta\T \tilde{\v x}\ls i)\, \tilde{\v x}\ls i \\ &= - g(\v\theta\T \tilde{\v x}\ls i)\, \tilde{\v x}\ls i\end{split}\]

This leads to the following expression for the gradient:

\[\begin{split}\frac{\partial J(\v\theta)}{\partial \v\theta} &= \frac{1}{m} \sum_{i=1}^{m} -y\ls i (1 - g(\v\theta\T \tilde{\v x}\ls i))\,\tilde{\v x}\ls i + (1-y\ls i)g(\v\theta\T \tilde{\v x}\ls i)\, \tilde{\v x}\ls i\\ &= \frac{1}{m} \sum_{i=1}^{m} \left( g(\v\theta\T \tilde{\v x}\ls i) - y\ls i \right) \tilde{\v x}\ls i\\ &= \frac{1}{m} \sum_{i=1}^{m} \left( h_{\v\theta}(\v x\ls i) - y\ls i \right) \tilde{\v x}\ls i\end{split}\]

This last expression is equal to the expression for the gradient for linear regression. Carefully note that the hypothesis function is different in these cases.