Consider the squared error

\[\epsilon = \sum_{i=1}^N e_i^2 = \| \v e \|^2 = \v e\T\v e\]

i.e.

\[\epsilon = \v v\T\v v = (\v f-\v X\v p)\T(\v f-\v X\v p)= \v f\T\v f - 2\v p\T\v X\T\v f+\v p\T\v X\T\v X\v p\]

The goal is to find the parameter vector \(\v p^\star\) that minimizes the above expression. A necessary condition is that all derivatives \(\partial \epsilon / \partial p_i\) for \(i=1,\ldots,n\) are zero (it is also a sufficient condition due to the quadratic nature of the error function).

Note that the first term \(\v f\T\v f\) is not dependent on the parameter vector and thus disappears in all derivatives. For ease of notation let’s write \(\v v = \v X\T f\) and \(\v A = \v X\T \v X\) then straightforward use of definitions we have:

\[\v p\T\v v = \sum_{j=1}^n p_j v_j\]

and

\[\v p\T A \v p = \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l\]

leading to

\[\frac{\partial\epsilon(p_1,\ldots,p_n)}{\partial p_i} = -2\frac{\partial}{\partial p_i}\left(\sum_{j=1}^n v_j p_j\right) + \frac{\partial}{\partial p_i}\left( \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l \right)\]

Consider the first differentiation:

\[\frac{\partial}{\partial p_i}\left(\sum_{j=1}^n v_j p_j\right) = v_i\]

For the second differentiation we have:

\[\begin{split}\frac{\partial}{\partial p_i}\left( \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l \right) &= \sum_{k=1}^n A_{k,i} p_k + \sum_{l=1}^n A_{i,l} p_l\\ &= \sum_{k=1}^n (A_{k,i}+A_{i,k}) p_k \\ &= \sum_{k=1}^n (A_{k,i}+A\T_{k,i}) p_k \\ &= \left((\v A+\v A\T)\v p\right)_i\end{split}\]

And thus

\[\frac{\partial\epsilon}{\partial p_i} = -2 v_i + \left((\v A+\v A\T)\v p\right)_i\]

Now with the definition for differentiating a scalar function with respect to a scalar we get:

\[\frac{\partial\epsilon}{\partial \v p} = -2 \v v + (\v A+\v A\T)\v p\]

Substituting the definition for \(\v v\) and \(\v A\) again we get

\[\frac{\partial\epsilon}{\partial \v p} = -2 \v X\T \v f + 2\v X\T\v X \v p\]