5.1.5. Linear Regression from Basic Principles

5.1.5.1. A Statistical View on Regression

We consider a system with scalar input \(x\) and scalar output \(y\). We hypothesize that in the absence of ‘noise’ we have \(y = m(x)\). The exact analytical form of the function \(m\) is not known, all we have are examples \((x\ls i,y\ls i)\). Unfortunately we know that \(y\ls i \not= m(x\ls i)\) due to noise. We assume that the noise is iid for all \(x\) and normal distributed. This makes our observation a stochastic variable:

\[Y = m(x) + R\]

where \(R\) is the noise random variable. We assume \(R\sim\Normal(\mu,\sigma^2)\). Thus

\[Y \sim \Normal( m(x), \sigma^2 )\]

The goal of linear regression is to come up with an expression for the function \(m\). To make this mathematically tractable we approximate \(m\) with a parameterized hypothesis function \(h_{\v\theta}\) arriving at:

\[Y \sim \Normal( h_{\v\theta}(x), \sigma^2 )\]

With our learning set \((x\ls i,y\ls i)\) for \(i=1,\ldots,m\) the regression task is to find the parameter vector \(\v\theta\) that best ‘fits’ the data.

5.1.5.2. Maximum Likelihood Estimator

We are looking for the parameter vector \(\v\theta\) that makes the observed data most plausible. This can be casted in a form that is called a maximum likelihood estimator. Let \(f\) be the joint probability density function for all observations \(y\ls i\) given the \(x\ls i\) values and parameter vector \(\v\theta\). The MLE then can be written as:

\[\hat{\v\theta} = \arg\max_{\v\theta} f(y\ls 1, \ldots,y\ls m \bigm| x\ls 1, \ldots, x\ls m; \v\theta)\]

As we have assumed that the noise in observations are iid and normally distributed we can write:

\[\begin{split}f(y\ls 1, \ldots,y\ls m \bigm| x\ls 1, \ldots, x\ls m; \v\theta) &= \prod_{i=1}^{m} f(y\ls i\bigm| x\ls i; \v\theta)\\ &= \prod_{i=1}^{m} \frac{1}{\sigma\sqrt{2\pi}} \exp\left( - \frac{(y\ls i - h_{\v\theta}(x\ls i))^2}{2\sigma^2}\right)\end{split}\]

Our goal is to maximize the probability density and thus we may maximize the logarithm of it. That function of \(\v\theta\) is called the log likelihood:

\[\begin{split}\ell(\v\theta) &= \log\left( \prod_{i=1}^{m} \frac{1}{\sigma\sqrt{2\pi}} \exp\left( - \frac{(y\ls i - h_{\v\theta}(x\ls i))^2}{2\sigma^2}\right) \right)\\ &= \sum_{i=1}^{m} \left( -\log(\sigma\sqrt{2\pi}) - \frac{(y\ls i - h_{\v\theta}(x\ls i))^2}{2\sigma^2} \right)\end{split}\]

Our goal it to maximize \(\ell\) and that is equivalent with minimizing the sum of squared errors:

\[\hat{\v\theta} = \arg\min_{\v\theta} \sum_{i=1}^{m} \left( y\ls i - h_{\v\theta}(x\ls i) \right)^2\]

This is exactly the same expression we started with when introducing linear regression. In this section we see that linear regression is optimal in case we have normally distributed noise. In practice that is not very often the case (especially in the machine learning context) but even then linear regression is often used (mostly with the addition of a regularization term).