Consider the squared error
\[\epsilon = \sum_{i=1}^N e_i^2 = \| \v e \|^2 =
\v e\T\v e\]
i.e.
\[\epsilon = \v v\T\v v =
(\v f-\v X\v p)\T(\v f-\v X\v p)= \v f\T\v f -
2\v p\T\v X\T\v f+\v p\T\v X\T\v X\v p\]
The goal is to find the parameter vector \(\v p^\star\) that minimizes
the above expression. A necessary condition is that all derivatives
\(\partial \epsilon / \partial p_i\) for \(i=1,\ldots,n\) are zero (it is
also a sufficient condition due to the quadratic nature of the error
function).
Note that the first term \(\v f\T\v f\) is not dependent on the
parameter vector and thus disappears in all derivatives. For ease of
notation let’s write \(\v v = \v X\T f\) and \(\v A = \v X\T \v X\) then
straightforward use of definitions we have:
\[\v p\T\v v = \sum_{j=1}^n p_j v_j\]
and
\[\v p\T A \v p = \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l\]
leading to
\[\frac{\partial\epsilon(p_1,\ldots,p_n)}{\partial p_i} =
-2\frac{\partial}{\partial p_i}\left(\sum_{j=1}^n v_j p_j\right) +
\frac{\partial}{\partial p_i}\left( \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l \right)\]
Consider the first differentiation:
\[\frac{\partial}{\partial p_i}\left(\sum_{j=1}^n v_j p_j\right) =
v_i\]
For the second differentiation we have:
\[\begin{split}\frac{\partial}{\partial p_i}\left( \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l \right)
&= \sum_{k=1}^n A_{k,i} p_k + \sum_{l=1}^n A_{i,l} p_l\\
&= \sum_{k=1}^n (A_{k,i}+A_{i,k}) p_k \\
&= \sum_{k=1}^n (A_{k,i}+A\T_{k,i}) p_k \\
&= \left((\v A+\v A\T)\v p\right)_i\end{split}\]
And thus
\[\frac{\partial\epsilon}{\partial p_i} = -2 v_i + \left((\v A+\v A\T)\v p\right)_i\]
Now with the definition for differentiating a scalar function with
respect to a scalar we get:
\[\frac{\partial\epsilon}{\partial \v p} = -2 \v v + (\v A+\v A\T)\v p\]
Substituting the definition for \(\v v\) and \(\v A\) again we get
\[\frac{\partial\epsilon}{\partial \v p} = -2 \v X\T \v f + 2\v X\T\v X \v p\]