1.5. Matrix Calculus

Essential in machine learning is optimization. Almost all machine learning algorithms start with optimization of a scalar loss (or cost) function with respect to an input vector \(\v x\) or parameter vector \(\v p\).

Things become more complicated when we differentiate all elements of a vector with respect to all elements of another vector. Or even when matrices are involved.

Matrix calculus provides the tools to elegantly deal with these derivatives.

This section is based on the Wikipedia article on matrix calculus. We make a choise for the so called denominator layout as explained in the Wikipedia article.

1.5.1. The derivative of a scalar function with respect to a vector

Let \(y\) be a scalar function of all elements \(x_i\) in vector \(\v x\). By definition we state:

\[\begin{split}\pfrac{y}{\v x} = \begin{pmatrix} \pfrac{y}{x_1}\\ \vdots \\ \pfrac{y}{x_n} \end{pmatrix}\end{split}\]

The derivative of such a scalar function is often called the gradient of the function.

1.5.2. The derivative of a vector with respect to a scalar

Let \(\v y\) be a vector and let \(x\) be a scalar then:

\[\pfrac{\v y}{x} = \begin{pmatrix} \pfrac{y_1}{x} & \cdots & \pfrac{y_m}{x} \end{pmatrix}\]

1.5.3. The derivative of a vector with respect to a vector

\[\begin{split}\pfrac{\v y}{\v x} = \begin{pmatrix} \pfrac{y_1}{x_1} & \cdots & \pfrac{y_m}{x_1}\\ \vdots & \ddots & \vdots\\ \pfrac{y_1}{x_n} & \cdots & \pfrac{y_m}{x_n} \end{pmatrix}\end{split}\]

1.5.4. Some important derivatives in the machine learning context

For \(A\) not a function of \(\v x\):

(1.5.1)\[\pfrac{A\v x}{\v x} = A\T\]

(1.5.2)\[\pfrac{x\T A}{\v x} = A\]

(1.5.3)\[\pfrac{\v x\T A \v x}{\v x} = (A+A\T)\v x\]

Let \(\v u\) be a vector function such that \(\v x\in\setR^2\mapsto\v u(\v x)\in\setR^m\), then:

(1.5.4)\[\pfrac{A\v u}{\v x} = \pfrac{\v u}{\v x} A\T\]

or more generally

\[\pfrac{\v f(\v u)}{\v x} = \pfrac{\v u}{\v x} \pfrac{\v f(\v u)}{\v u}\]

where \(\v f\) is a vector valued function. Note that the choice \(\v f(\v u)=A\v u\) and using pxTAxpx leads to the result in Eq.1.5.4.

Let \(f\) be a scalar function then with \(f\cdot(\v x)\) we denote the elementwise application of the function \(f\) to the vector \(\v x\):

\[\begin{split}f\cdot(\v x) = \begin{pmatrix} f(x_1)\\ \vdots \\ f(x_n)\end{pmatrix}\end{split}\]

We introduce this special notation to prevent confusion with a scalar function, say \(g\), that has a vector as argument an produces a scalar: \(g(\v x)\). Again let \(\v y = \v y(\v x)\), then:

(1.5.5)\[\begin{split}\pfrac{f\cdot(\v y)}{\v x} &= \pfrac{}{\v x}\begin{pmatrix} f(y_1)\\ \vdots \\ f(y_m)\end{pmatrix}\\ &= \begin{pmatrix} \pfrac{f(y_1)}{x_1} & \cdots & \pfrac{f(y_m)}{x_1}\\ \vdots & \ddots & \vdots\\ \pfrac{f(y_1)}{x_n} & \cdots & \pfrac{f(y_m)}{x_n} \end{pmatrix}\\ &= \begin{pmatrix} f'(y_1)\pfrac{y_1}{x_1} & \cdots & f'(y_m)\pfrac{y_m}{x_1}\\ \vdots & \ddots & \vdots\\ f'(y_1)\pfrac{y_1}{x_n} & \cdots & f'(y_m)\pfrac{y_m}{x_n} \end{pmatrix}\\ &= \pfrac{\v y}{\v x} \text{diag}(f'\cdot(\v y))\end{split}\]

1.5.5. Examples from Machine Learning

1.5.5.1. Linear Regression

The cost function in linear regression is

\[J(\v\theta) = \frac{1}{2m} \|\hv X \v\theta - \v y\|^2\]

The gradient function then is:

\[\begin{split}\pfrac{J(\v\theta)}{\v\theta} &= \frac{1}{2m}\pfrac{}{\v\theta} \left( (\hv X \v\theta - \v y)\T(\hv X \v\theta - \v y)\right)\\ &= \frac{1}{2m}\pfrac{}{\v\theta} \left( \v\theta\T\hv X\T\hv X \v\theta - 2 \theta\T\hv X\T\v y + \v y\T\v y \right)\\ &= \frac{1}{2m}\left( 2\hv X\T\hv X \v\theta - 2\hv X\T \v y\right) \\ &= \frac{1}{m} \hv X\T (\hv X\v\theta - \v y)\end{split}\]

1.5.5.2. Logistic Regression

The cost function in logistic regression is:

\[J(\v\theta) = \frac{1}{m} \left( -\v y \T \log\cdot(g\cdot(\hv X\v\theta)) - (\v 1 - \v y)\T \log\cdot( \v 1 - g\cdot(\hv X \v\theta)) \right)\]

In calculating the gradient we first consider the term:

\[\pfrac{}{\v\theta} \v y\T \log\cdot(g\cdot(\hv X\v\theta)) = \pfrac{}{\v\theta} \v y\T f\cdot(\hv X \v\theta)\]

where we introduced \(f = \log \circ g\) (the composition of \(g\) after \(\log\). Using Eq.1.5.4 we get

\[\pfrac{}{\v\theta} \v y\T f\cdot(\hv X \v\theta) = \pfrac{f\cdot(\hv X \v\theta)}{\v\theta} \v y\]

and then using Eq.1.5.5 we get

\[\begin{split}\pfrac{}{\v\theta} \v y\T f\cdot(\hv X \v\theta) &= \pfrac{\hv X\v\theta}{\v\theta} \text{diag}(f'\cdot(\hv X\v\theta)) \v y\\ &= \hv X\T \text{diag}(f'\cdot(\hv X\v\theta)) \v y\end{split}\]

Observe that because \(g'(v) = g(v)(1-g(v))\) we have that \(f'(v) = 1 - g(v)\) and thus

\[\pfrac{}{\v\theta} \v y\T f\cdot(\hv X \v\theta) = \hv X\T \text{diag}((1-g)\cdot(\hv X\v\theta)) \v y\]

For the second term in the gradient we get:

\[\pfrac{}{\v\theta}(\v 1- \v y)\T (\log\cdot(1-g)\cdot(\hv X \v\theta) = - \hv X\T \text{diag}(g\cdot(\hv X\v\theta))(\v 1-\v y)\]

For the sum of both terms:

\[\hv X\T\left( \text{diag}((1-g)\cdot(\hv X\v\theta))\v y - \text{diag}(g\cdot(\hv X\v\theta))(\v 1 - \v y) \right)\]