Consider the squared error .. math:: \epsilon = \sum_{i=1}^N e_i^2 = \| \v e \|^2 = \v e\T\v e i.e. .. math:: \epsilon = \v v\T\v v = (\v f-\v X\v p)\T(\v f-\v X\v p)= \v f\T\v f - 2\v p\T\v X\T\v f+\v p\T\v X\T\v X\v p The goal is to find the parameter vector $\v p^\star$ that minimizes the above expression. A necessary condition is that all derivatives $\partial \epsilon / \partial p_i$ for $i=1,\ldots,n$ are zero (it is also a sufficient condition due to the quadratic nature of the error function). Note that the first term $\v f\T\v f$ is not dependent on the parameter vector and thus disappears in all derivatives. For ease of notation let's write $\v v = \v X\T f$ and $\v A = \v X\T \v X$ then straightforward use of definitions we have: .. math:: \v p\T\v v = \sum_{j=1}^n p_j v_j and .. math:: \v p\T A \v p = \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l leading to .. math:: \frac{\partial\epsilon(p_1,\ldots,p_n)}{\partial p_i} = -2\frac{\partial}{\partial p_i}\left(\sum_{j=1}^n v_j p_j\right) + \frac{\partial}{\partial p_i}\left( \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l \right) Consider the first differentiation: .. math:: \frac{\partial}{\partial p_i}\left(\sum_{j=1}^n v_j p_j\right) = v_i For the second differentiation we have: .. math:: \frac{\partial}{\partial p_i}\left( \sum_{k=1}^n \sum_{l=1}^n A_{k,l} p_k p_l \right) &= \sum_{k=1}^n A_{k,i} p_k + \sum_{l=1}^n A_{i,l} p_l\\ &= \sum_{k=1}^n (A_{k,i}+A_{i,k}) p_k \\ &= \sum_{k=1}^n (A_{k,i}+A\T_{k,i}) p_k \\ &= \left((\v A+\v A\T)\v p\right)_i And thus .. math:: \frac{\partial\epsilon}{\partial p_i} = -2 v_i + \left((\v A+\v A\T)\v p\right)_i Now with the definition for differentiating a scalar function with respect to a scalar we get: .. math:: \frac{\partial\epsilon}{\partial \v p} = -2 \v v + (\v A+\v A\T)\v p Substituting the definition for $\v v$ and $\v A$ again we get .. math:: \frac{\partial\epsilon}{\partial \v p} = -2 \v X\T \v f + 2\v X\T\v X \v p