1.5. Matrix Calculus
Essential in machine learning is optimization. Almost all machine learning algorithms start with optimization of a scalar loss (or cost) function with respect to an input vector \(\v x\) or parameter vector \(\v p\).
Things become more complicated when we differentiate all elements of a vector with respect to all elements of another vector. Or even when matrices are involved.
Matrix calculus provides the tools to elegantly deal with these derivatives.
This section is based on the Wikipedia article on matrix calculus. We make a choise for the so called denominator layout as explained in the Wikipedia article.
1.5.1. The derivative of a scalar function with respect to a vector
Let \(y\) be a scalar function of all elements \(x_i\) in vector \(\v x\). By definition we state:
The derivative of such a scalar function is often called the gradient of the function.
1.5.2. The derivative of a vector with respect to a scalar
Let \(\v y\) be a vector and let \(x\) be a scalar then:
1.5.3. The derivative of a vector with respect to a vector
1.5.4. Some important derivatives in the machine learning context
For \(A\) not a function of \(\v x\):
Let \(\v u\) be a vector function such that \(\v x\in\setR^2\mapsto\v u(\v x)\in\setR^m\), then:
or more generally
where \(\v f\) is a vector valued function. Note that the choice \(\v f(\v
u)=A\v u\) and using pxTAxpx
leads to the result in Eq.1.5.4.
Let \(f\) be a scalar function then with \(f\cdot(\v x)\) we denote the elementwise application of the function \(f\) to the vector \(\v x\):
We introduce this special notation to prevent confusion with a scalar function, say \(g\), that has a vector as argument an produces a scalar: \(g(\v x)\). Again let \(\v y = \v y(\v x)\), then:
1.5.5. Examples from Machine Learning
1.5.5.1. Linear Regression
The cost function in linear regression is
The gradient function then is:
1.5.5.2. Logistic Regression
The cost function in logistic regression is:
In calculating the gradient we first consider the term:
where we introduced \(f = \log \circ g\) (the composition of \(g\) after \(\log\). Using Eq.1.5.4 we get
and then using Eq.1.5.5 we get
Observe that because \(g'(v) = g(v)(1-g(v))\) we have that \(f'(v) = 1 - g(v)\) and thus
For the second term in the gradient we get:
For the sum of both terms: