5.1.1. Univariate Linear Regression

5.1.1.1. One Learning Examples at a Time

Let \((x\ls i, y\ls i)\) for \(i=1,\ldots,m\) be the set of points through which we want to fit a straight line. The hypothesis or model function in this case is:

\[h_{\v\theta}(x) = \theta_0 + \theta_1 x\]

Here the parameter vector \(\v\theta\) is given by:

\[\begin{split}\v\theta = \matvec{c}{\theta_0 \\ \theta_1}\end{split}\]

Fitting a straight line then amounts to finding the “optimal” parameters \(\theta_0\) and \(\theta_1\) such that \(y\ls i \approx h_{\v\theta}(x\ls i)\) for all \(i\). Optimal parameters are found by minimizing a cost function:

\[J(\v\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_{\v\theta}(x\ls i) - y\ls i \right)^2\]

Such that the optimal parameter vector is found with:

\[\v\theta^\star = \arg\min_{\v\theta} J(\v\theta)\]

All we have to do from here is to find the \(\v \theta\) vector that minimizes the cost function. For this particular cost function it is doable to analytically calculate the unique optimal parameter vector (we will do so in a later section). Here we will use a numerical iterative algorithm to ‘search’ for the optimal parameter vector. There are very many numerical optimization algorithms. We only discuss one: the Gradient Descent Algorithm which in its (over) simplified form can be written as:

\[\text{iterate until done:}\qquad \v\theta := \v\theta - \alpha \frac{\partial J(\v\theta)}{\partial \v\theta}\]

Evidently the stopping criterion is ill defined here. It it hard to define such a criterion. Often the best way to define such a criterion on is to observe the value of \(J\) while iterating and set a maximum on the number of iterations.

The gradient vector can be easily calculated to be:

\[\begin{split}\frac{\partial J(\v\theta)}{\partial \v\theta} = \matvec{c}{ \frac{J(\v\theta)}{\theta_0} \\ \frac{J(\v\theta)}{\theta_1}} = \frac{1}{m}\sum_{i=1}^{m} \left( h_{\v\theta}(x\ls i) - y\ls i \right) \matvec{c}{1\\ x\ls i}\end{split}\]

We can tidy up this equation a bit using:

\[\begin{split}\hv x\ls i = \matvec{c}{1\\x\ls i}\end{split}\]

resulting in:

\[\frac{\partial J(\v\theta)}{\partial \v\theta} = \frac{1}{m}\sum_{i=1}^{m} \left( \v\theta\T \hv x\ls i - y\ls i \right)\hv x\ls i\]

The algorithm then is:

\[\text{iterate until done:}\qquad \v\theta := \v\theta - \alpha \frac{1}{m}\sum_{i=1}^{m} \left( \v\theta\T \hv x\ls i - y\ls i \right)\hv x\ls i\]

Note that in the formulation given above the entire learning set (i.e. all \(m\) examples) are used to calculate every step in the gradient descent procedure. Later on in these notes when discussing neural networks we will restrict the calculation of the gradient to a subset of all examples (and in the extreme case we use only one example to calculate the step to be made in the gradient descent procedure). This type of learning is called stochastic gradient descent.

5.1.1.2. Vectorization of the Learning Set

We consider the same problem as in the previous section, i.e. minimization of the cost function:

\[\begin{split}J(\v\theta) &= \frac{1}{2m} \sum_{i=1}^{m} \left( h_{\v\theta}(x\ls i) - y\ls i \right)^2\\ &= \frac{1}{2m} \sum_{i=1}^{m} \left( \v \theta\T \hv x\ls i) - y\ls i \right)^2\end{split}\]

Let the vector \(\v y\) contain all target values from the learning set:

\[\begin{split}\v y = \matvec{c}{y \ls 1\\ \vdots \\ y \ls m}\end{split}\]

and let \(\v X\) be the matrix

\[\begin{split}\hv X = \matvec{c}{\hv x \ls 1 \T \\ \vdots \\ \hv x \ls m \T }\end{split}\]

Note that the example vectors are now the rows in the matrix. Now the expression for the cost function can be written as:

\[J(\v \theta) = \frac{1}{2m} \| \hv X \v \theta - \v y \|^2\]

The gradient of the cost function using an explicit summation over the learning examples was derived to be:

\[\frac{\partial J(\v\theta)}{\partial \v\theta} = \frac{1}{m}\sum_{i=1}^{m} \left( \v\theta\T \hv x\ls i - y\ls i \right)\hv x\ls i\]

In vectorized form we have:

\[\frac{\partial J(\v\theta)}{\partial \v\theta} = \frac{1}{m} \hv X\T (\hv X \v\theta - \v y)\]

Leading to the gradient descent algorithm:

\[\text{iterate until done:}\qquad \v\theta := \v\theta - \alpha \frac{1}{m} \hv X\T (\hv X \v\theta - \v y)\]