$\newcommand{\in}{\text{in}}$ $\newcommand{\out}{\text{out}}$ $\newcommand{\prt}{\partial}$

6.4.3.1. Linear Block

The computational graph of a fully connected linear module is depicted in the figure below.

../../../_images/nn_linear.png — Fig. 6.4.15 **Fully Connected Linear Block.** The input vector $\v x$ is mapped on output vector $\v u = W\v x$.

The input is an $s_\in$-dimensional vector $\v x$ and the output is a vector $\v u$ that is $s_\out$ dimensional. We have:

\[\underbrace{\v u}_{(s_\out\times 1)} = \underbrace{W}_{(s_\out\times s_\in)} \underbrace{\v x}_{(s_\in\times1)}\]

where $W$ is an $s_\out\times s_\in$ matrix. Assuming $\pfrac{\ell}{\v y}$ is known we can calculate $\pfrac{\ell}{\v x}$:

\[\begin{split}\frac{\prt\ell}{\prt\v x} &= \frac{\prt \v u}{\prt\v x}\frac{\prt \ell}{\prt\v u}\\ &= W\T \frac{\prt\ell}{\prt\v u}\end{split}\]

The proof of this result is rather simple. We could either dive into matrix calculus (see Matrix Calculus) or we can give a straightforward proof by looking at the components of the vectors keeping in mind the chain rule for multivariate functions (see Multivariate Functions). We will follow the second route.

\[\begin{split}\pfrac{\ell}{x_i} &= \sum_{j=1}^{s_\out} \pfrac{u_j}{x_i}\pfrac{\ell}{u_j}\\ &= \sum_{j=1}^{s_\out} W_{ji}\pfrac{\ell}{u_j}\\ &= \sum_{j=1}^{s_\out} (W\T)_{ij} \pfrac{\ell}{u_j}\end{split}\]

and thus

\[\pfrac{\ell}{\v x} = W\T \pfrac{\ell}{\v u}\]

Next we need to know the derivative $\prt\ell/\prt W$ in order to update the weights in a gradient descent procedure. Again we start with an elementwise analysis:

\[\pfrac{\ell}{ W_{ij}} = \sum_{k=1}^{s_\out} \pfrac{u_k}{W_{ij}} \pfrac{\ell}{u_k}\]

where

\[\begin{split}\pfrac{u_k}{W_{ij}} = \pfrac{}{W_{ij}} \sum_{l=1}^{s_in} W_{kl} x_l = \begin{cases}x_j &: i=k\\0 &: i\not=k\end{cases} = x_j \delta_{ik}\end{split}\]

substituting this into the expression for $\prt \ell/\prt W_{ij}$ we get:

\[\pfrac{\ell}{W_{ij}} = \sum_{k=1}^{s_\out} x_j \delta_{ik} \pfrac{\ell}{u_k} = x_j \pfrac{\ell}{u_i}\]

or equivalently:

\[\pfrac{\ell}{W} = \pfrac{\ell}{\v u} \v x\T\]

Let $X$ be the data matrix in which each row is an input vector and $U$ is the matrix in which each row it the corresponding output vector then

\[U\T = W X\T\]

or

\[U = X W\T\]

where each row in $U$ is the linear response to the corresponding row in $X$. In this case:

\[\pfrac{\ell}{X\T} = W\T \pfrac{\ell}{U\T}\]

or

\[\pfrac{\ell}{X} = \pfrac{\ell}{U} W\]

For the derivative with respect to the weight matrix we have:

\[\pfrac{\ell}{W} = \left(\pfrac{\ell}{U}\right)^\top X\]