The Chain Rule of Differentiation ================================= A chain of functions -------------------- .. sidebar:: Example .. math:: f(x) &= x^2 \\ g(x) &= \exp(-x) \\ h(x) &= C x .. math:: k(x) &= h(g(f(x))) \\ &= C \exp(-x^2) .. math:: k'(x) = -2 C x \exp(-x^2) Consider the functions $f$, $g$ and $h$. We consider the chain composition of these functions $h\after g\after f$ defined as: .. math:: y = k(x) = (h\after g\after f)(x) = h(g(f(x))) In a flow graph the function composition is sketched in :numref:`fig-hogof`. .. _fig-hogof: .. figure:: /figures/hogof.png :figwidth: 100% :width: 50% :align: center **Flow graph of the function composition** $h\after g\after f$ The Chain Rule -------------- Let's start with the composition of two functions: .. math:: h(x) = (g\after f)(x) = g(f(x)) When we differentiate a function we are looking at the change in the function value as a consequence of an infinitesimal change in the independent variable $x$. We have in first order: .. math:: h(x+dx) = h(x) + h'(x) dx Now we do the same for $h(x)=g(f(x))$: .. math:: h(x+dx) &= g(f(x+dx))\\ &= g(f(x)+f'(x)dx)\\ &= g(f(x)) + g'(f(x))f'(x)dx\\ &= h(x) + g'(f(x)) f'(x) dx Comparing both expressions for $h(x+dx)$ we get: .. math:: h'(x) = g'(f(x)) f'(x) the familiar chain rule that all of us learned at high school. .. sidebar:: Example .. math:: f(x) &= x^2 \\ g(x) &= \exp(-x) \\ h(x) &= C x .. math:: k(x) &= h(g(f(x))) \\ &= C \exp(-x^2) .. math:: f'(x) &= 2x\\ g'(x) &= -\exp(x)\\ h'(x) &= C .. math:: k'(x) &= h'(g(f(x))\;g'(f(x))\; f'(x)\\ &= C \; (-\exp(x^2)) \; (2 x)\\ &= -2Cx\exp(-x^2) The derivative of the composed function $k = h\after g\after f$ is given by the chain rule .. math:: :label: eq-chainrule \frac{dk}{dx} = (h\after g\after f)'(x) = h'(g(f(x))\;g'(f(x))\; f'(x) This is easily proved given the chain rule for the composition of two functions. In functional terms we can write: .. math:: (h\after g\after f)' = (h'\after g \after f)\;(g'\after f)\;f' where the multiplication of functions like $f g$ is defined as $(fg)(x)=f(x)g(x)$. The Leibnitz Notation --------------------- Consider again the function composition .. math:: (h\after g\after f)(x) = h(g(f(x))) If we define $u = f(x)$ and $v = (g\after f)(x)$ and indicate the values in the flow graph (see :numref:`fig-hogof-uv`) .. _fig-hogof-uv: .. figure:: /figures/hogof_uv.png :figwidth: 100% :width: 50% :align: center **Flow graph of the function composition** $h\after g\after f$ with intermediate value. The derivative of the composed function is: .. math:: (h\after g\after f)'(x) = h'(v) g'(u) f'(x) or equivalently: .. math:: \frac{dy}{dx} = \frac{dy}{dv} \frac{dv}{du} \frac{du}{dx} From here the traditional Leibnitz notation gets a bit sloppy, confusing at the start but convenient once you get used to it. Instead of using a new symbol for the result of a function in the chain (in our example $u=f(x)$ and $v=g(f(x))$) we will use the function names for its output value as well. The chain rule in Leibnitz form then becomes: .. math:: \frac{dh}{dx} = \frac{dh}{dg}\;\frac{dg}{df}\;\frac{df}{dx} In summary: for every node (function) in the chain we diffentiate the output with respect to its input. And multiplying these 'one-function-derivatives' we can calculate the derivative of (part of) the chain. Multivariate Chain Rule ----------------------- In the machine learning context we will be using the chain rule a lot for multivariate functions. Consider a simple one .. math:: f(u, v) We will look at how the value of $f$ will change if both $u$ and $v$ change a small bit, say from $u$ to $u+du$ and from $v$ to $v+dv$. The change $df$ in $f$ then is: .. math:: df = f(u+du, v+dv) - f(u, v) Note that in the univariate case we have .. math:: f(x+dx) = f(x) + f_x(x) dx if we now only consider the change in $u$ and keep the second argument constant we get: .. math:: df = f(u, v+dv) + f_u(u, v+dv) du - f(u,v) Now for both terms we keep $u$ constant and apply the same line of reasoning to the second argument of $f$ we get: .. math:: df &= f(u,v) + f_v(u,v)dv + (f_u(u,v) + f_{uv}(u,v)dv)du - f(u,v)\\ &= f_u(u,v)du + f_v(u,v)dv + f_{uv}(u,v) du dv Note that $dudv$ is neglible small when compared with both $du$ and $dv$ and thus within first order: .. math:: df = f_u(u,v) du + f_v(u,v) dv This is called the **total derivative** of $f$. The above is an intuitive introduction of the concept but it can be given a solid mathematical foundation. **The total derivative states that the change of** $f$ **is the sum of the contributions due to the changes in its arguments.** Now we consider the situation where both $u$ and $v$ are functions of $x$. Then we have .. math:: df &= f_u(u(x), v(x)) du(x) + f_v(u(x),v(x))dv(x)\\ &= f_u(u(x), v(x)) u_x(x) dx + f_v(u(x),v(x)) v_x(x) dx and thus .. math:: \frac{df}{dx} = f_u(u(x), v(x)) u_x(x) + f_v(u(x),v(x)) v_x(x) or in Leibnits notation: .. math:: \frac{df}{dx} = \pfrac{f}{u} \frac{du}{dx} + \pfrac{f}{v} \frac{dv}{dx} In general for a function $f$ with $n$ arguments each depending on $x$: .. math:: g(x) = f(u_1(x),\cdots,u_n(x)) we have: .. math:: g'(x) = \sum_{i=1}^n f_{u_i}(u_1(x),\cdots,u_n(x)) u_i'(x) Or in Leibnitz notation .. math:: \frac{dg}{dx} = \sum_{i=1}^n \pfrac{f}{u_i} \frac{du_i}{dx} And finally if we are given a multivariate function $f$ whose $n$ arguments all depend on $m$ arguments $x_1,\ldots,x_m$, i.e. .. math:: g(x_1,\ldots,x_m) = f(u_1(x_1,\ldots,x_m),\ldots,u_n(x_1,\ldots,x_m)), we get .. math:: \pfrac{g}{x_j} = \sum_{i=1}^n \pfrac{f}{u_i} \pfrac{u_i}{x_j} This last expression is of great importance in the machine learning context. .. langere omslachtige manier In the machine learning context we will be using the chain rule a lot for multivariate functions. Consider a simple one .. math:: g(x) = f(u(x), v(x)) thus $f$ is a multivariate function with two arguments. The first argument is a function $u$ in $x$ and the second argument is a function $v$ in $x$. Differentiating $g$ with respect to $x$ we are looking in what way the value of $g$ changes in case $x$ changes a bit from $x$ to $x+dx$. .. math:: g(x+dx) = f(u(x+dx), v(x+dx)) To simplify the notation a bit we write .. math:: g(x+dx) = f(u+du, v+dv) where we omitted the argument $x$ for the $u$ and $v$ function (but remember they are still functions of $x$). Look at $u+du$: .. math:: u+du = u(x+dx) = u(x)+u'(x)dx from which we conclude that $du = u'dx$ (again omitting the argument $x$ from $u'$). In the same way we see that $dv = v'dx$. We continue with the term $f(u+du, v+dv)$. Concentrating on the first argument $u+du$ we can write in first order .. math:: f(u+du, v+dv) = f(u, v+dv) + f_u(u, v+dv) du, where we have used the (commonly used) notation .. math:: f_u = \pfrac{f}{u}. For the term $f(u,v+dv)$ we can do the same but then for the second argument of $f$: .. math:: f(u, v+dv) = f(u,v) + f_v(u,v)dv For the term $f_u(u,v+dv)$ we have .. math:: f_u(u, v+dv) = f_u(u,v) + f_{uv}(u,v)dv Combining all the terms we have .. math:: f(u+du, v+dv) = f(u,v) + f_u(u,v)du + f_v(u,v)dv + f_{uv}(u,v)dudv but as we are still interested in the first order behavior we can omit the last term (if $du$ and $dv$ are infinitesimally small then $du dv$ is negligible in comparison). .. math:: f(u+du, v+dv) = f(u,v) + f_u(u,v)du + f_v(u,v)dv Now we reintroduce the $x$ arguments .. math:: :label: eq-1stf f(u(x+dx), v(x+dx)) &= f(u(x), v(x)) + \\ &f_u(u(x),v(x))u'(x)dx + f_v(u(x),v(x))v'(x)dx Note that $f(u(x+dx), v(x+dx))=g(x+dx)$ by definition and so .. math:: :label: eq-1stg g(x+dx) = g(x)+g'(x)dx Comparing :eq:`1stf` and :eq:`1stg` we get .. math:: g(x)+g'(x)dx = f(u(x), v(x)) + f_u(u(x),v(x))u'(x)dx + f_v(u(x),v(x))v'(x)dx Remember that $g(x)=f(u(x), v(x))$ so these values on the left and right hand side cancel and then also the $dx$ factors cancel out and we are left with: .. math:: g'(x) = f_u(u(x),v(x))u'(x) + f_v(u(x),v(x))v'(x) or in Leibnitz notation .. math:: \frac{dg}{dx} = \pfrac{f}{u}\frac{du}{dx} + \pfrac{f}{v}\frac{dv}{dx} This long derivation was done just to show that when **using the chain rule on a multivariate function the contributions of all the arguments add up in the calculation of the derivative.** In general when .. math:: g(x) = f(u_1(x),\cdots,u_n(x)) then .. math:: g'(x) = \sum_{i=1}^n f_{u_i}(u_1(x),\cdots,u_n(x)) u_i'(x) Or in Leibnitz notation .. math:: \frac{dg}{dx} = \sum_{i=1}^n \pfrac{f}{u_i} \frac{du_i}{dx} And finally if we are given a multivariate function $g$ to start with: .. math:: g(x_1,\ldots,x_m) = f(u_1(x_1,\ldots,x_m),\ldots,u_n(x_1,\ldots,x_m)) then .. math:: \pfrac{g}{x_j} = \sum_{i=1}^n \pfrac{f}{u_i} \pfrac{u_i}{x_j} This last expression is of great importance in the machine learning context. Exercises --------- #. Chain rule for the composition of more then two functions Given the chain rule for the composition of two functions: .. math:: (g\after f)'(x) = g'(f(x))\;f'(x) or equivalently in functional notation .. math:: (g\after f)' = (g'\after f)\; f' a. Give a proof for the derivative of the composition of three functions, i.e. give a proof for :eq:`eq-chainrule`. b. Now consider the function composition $y = (f_n\after f_{n-1} \after\cdots\after f_1)(x)$. Give the derivative of $y$ with respect to $x$ of this function chain using the Leibnitz notation. #. Leibnitz notation for the chain rule Consider the function .. math:: y = \log(\cos(x^2)) can be written as the composition .. math:: y = (f\after g\after h)(x) #. Identify the three functions $f$, $g$ and $h$. #. Set $y = f(u)$, $u = g(v)$ and $v = h(x)$ and calculate $d y/ dx$ using the Leibnitz convention.