6.3.2. Maximum Likelihood Estimator

We have defined in a previous section that:

\[P(Y=1\given \v X=\v x) = h_{\v\theta}(\v x)\]

then of course for a 2 class problem:

\[P(Y=0\given \v X=\v x) = 1 - h_{\v\theta}(\v x)\]

We thus have that \(Y\given\v X=\v x \sim \Bernoulli(h_{\v\theta}(\v x))\). The probability function then is:

\[\begin{split}\P(Y=y\given\v X=\v x) = \begin{cases} h_{\v\theta}(\v x) &: y=1\\ 1 - h_{\v\theta}(\v x) &: y=0 \end{cases}\end{split}\]

or equivalently:

\[\P(Y=y\given\v X=\v x) = \left(h_{\v\theta}(\v x)\right)^y \, \left(1-h_{\v\theta}(\v x)\right)^{1-y}\]

Note that in the above expression we have used the fact that \(y\) is either zero or one.

The training set \((\v x\ls i, y\ls i)\) for \(i=1,\ldots,m\) can be considered to be the realization of \(m\) i.i.d. random vectors and variables \((\v X\ls i, Y\ls i)\). Probability for the entire training set then is

\[\begin{split}\P(Y\ls 1=y\ls 1,\ldots,Y\ls m=y\ls m \given \v X\ls 1 = \v x\ls 1,\ldots, \v X\ls m = \v x\ls m ) = \\ \prod_{i=1}^{m} \P(Y\ls 1=y\ls i\given \v X\ls i = \v x\ls i)\\ = \prod_{i=1}^{m} \left(h_{\v\theta}(\v x\ls i)\right)^{y\ls i} \, \left(1-h_{\v\theta}(\v x\ls i)\right)^{1-y\ls i}\end{split}\]

The above expression can also be interpreted as the likelihood of the data (the training set) given the parameter vector \(\v\theta\):

\[\ell(\v\theta) = \prod_{i=1}^{m} \left(h_{\v\theta}(\v x\ls i)\right)^{y\ls i} \, \left(1-h_{\v\theta}(\v x\ls i)\right)^{1-y\ls i}\]

The maximum likelihood estimator for theta is then given by:

\[\hat{\v\theta} = \arg\max_{\v\theta} \log( \ell(\v\theta))\]

Finding the optimal \(\v\theta\) has to be done using a numerical technique, unlike the case for linear regression there is no analytical solution for logistic regression. We thus have to calculate the gradient of \(-\log\ell(\v\theta)\) and then iterate the gradient descent steps as we have done for the linear regression.

You may skip the next subsection and take the gradient derivation for granted.