6.3.2. Maximum Likelihood Estimator
We have defined in a previous section that:
then of course for a 2 class problem:
We thus have that \(Y\given\v X=\v x \sim \Bernoulli(h_{\v\theta}(\v x))\). The probability function then is:
or equivalently:
Note that in the above expression we have used the fact that \(y\) is either zero or one.
The training set \((\v x\ls i, y\ls i)\) for \(i=1,\ldots,m\) can be considered to be the realization of \(m\) i.i.d. random vectors and variables \((\v X\ls i, Y\ls i)\). Probability for the entire training set then is
The above expression can also be interpreted as the likelihood of the data (the training set) given the parameter vector \(\v\theta\):
The maximum likelihood estimator for theta is then given by:
Finding the optimal \(\v\theta\) has to be done using a numerical technique, unlike the case for linear regression there is no analytical solution for logistic regression. We thus have to calculate the gradient of \(-\log\ell(\v\theta)\) and then iterate the gradient descent steps as we have done for the linear regression.
You may skip the next subsection and take the gradient derivation for granted.