Jacobian

Consider \(f(x) \in \mathbb{R}^m\), where \(x \in \mathbb{R}^n\). By the differential notation, \[ df = f'(x)dx \] where \(df \in \mathbb{R}^m \, \), \(dx \in \mathbb{R}^n\) and here, the linear operator \(f'(x)\) is the \(m \times n\) Jacobian matrix such that: \[ J_{ij} = \frac{\partial f_i}{\partial x_j}. \] Here, \(J_{ij}\) represents the rate of change of the \(i\)-th output \(f_i\) with respect to the \(j\)-th input \(x_j\).

For example, Let \(f(x) = \begin{bmatrix}f_1(x) \\ f_2(x) \end{bmatrix}\), and \(x = \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} \). Then \[ df = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} \end{bmatrix} \begin{bmatrix} dx_1 \\ dx_2 \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1}dx_1 + \frac{\partial f_1}{\partial x_2}dx_2 \\ \frac{\partial f_2}{\partial x_1}dx_1 + \frac{\partial f_2}{\partial x_2}dx_2 \end{bmatrix}. \] As you can see, the Jacobian matrix \(f'(x)\) acts as a linear operator that maps changes in \(x\) encoded in \(dx\) to corresponding changes in \(f(x)\) encoded in \(df\).
Note: Input & Output: vector \(\Longrightarrow\) First derivative: Jacobian matrix.

Instead of expressing the Jacobian component by component, it is often more convenient to use a symbolic representation.
For example, consider a trivial case: the matrix equation \(f(x) = Ax\) where \(A \in \mathbb{R}^{m \times n}\) is a constant matrix. Then \[ \begin{align*} df = f(x + dx) - f(x) &= A(x + dx) - Ax \\\\ &= Adx \\\\ &= f'(x)dx \end{align*} \] Thus \(f'(x) = A \) is the Jacobian matrix.

Chain Ruke

Theorem 1: Chain Rule Suppose \(f(x) = g(h(x))\) where both \(g\) and \(h\) are differentiable. Then \[ df = f'(x)[dx] = g'(h(x))[h'(x)[dx]] \]

Intuitively, if we think about a higher dimensional case, it is clear that the chain rule is not commutative in general.
For example, consider \[ x \in \mathbb{R}^n, \, h(x) \in \mathbb{R}^p, \, \text{and } g(h(x)) \in \mathbb{R}^m \] Then the output must be the \(m \times n\) Jacobian matrix \(f'(x)\).
To construct this \(m \times n\) matrix, we must need the product of the \(m \times p\) Jacobian matrix \(g'(h(x))\) and the \(p \times n\) Jacobian matrix h'(x) in this order.

Backpropagation

The backpropagation or more generally, reverse mode automatic differentiation is a widely used algorithm in machine learning to train neural networks. It computes gradients of the loss function with respect to the network's parameters(weights and biases) to minimize the loss, improving model performance.
Backpropagation works by explicitly transmitting gradients layer by layer, maintaining the order dictated by the chain rule. At each layer (or function), the gradient of the loss is computed with respect to the output of the previous layer.

Consider a small neural network, which has three layers(functions): \(f_1, f_2, f_3\). This network represents a composition of functions: \[ L(x) = f_3(f_2(f_1(x))) \] where \(x \in \mathbb{R}^n\) represents the inputs(parameters) and \(L(x) \in \mathbb{R}^m\) is the output often called the loss function.

Suppose \(f_1 \in \mathbb{R}^p\), \(f_2 \in \mathbb{R}^q\), and \(f_2 \in \mathbb{R}^m\).
To compute the derivative of this function, we apply the chain rule: \[ dL = f_3'f_2'f_1' = (m \times q)(q \times p)(p \times n) \] This is the multiplication of Jacobian matrices for each layer.

In reverse mode, the algorithm compute this expression from the left \(f_3'\) backward. So, in the context of neural network structure, the backpropagation process appears to follow a "reverse" order, starting from the output layer and moving backward through the network. However, from a mathematical perspective, this corresponds to a standard left-to-right multiplication of Jacobian matrices.

This reverse order makes it computationally efficient for large scale neural networks with many inputs (\(n\)) and a single output (\(m = 1\)). \[ \begin{align*} (f_3'f_2')f_1' &= [(1 \times q)(q \times p)](p \times n) \\\\ &= (1 \times p)(p \times n) \\\\ &= (1 \times n) \\\\ &= (\nabla L)^T \end{align*} \] This efficiency is due to backpropagation requiring only the computation of vector-Jacobian products, which are less computationally expensive than full matrix multiplications. At each step, only the product of a row vector(the transpose of a local gradient) with a Jacobian matrix is calculated.

In contrast,in forward mode differentiation requires full matrix multiplications. \[ \begin{align*} f_3'(f_2'f_1') &= (1 \times q)[(q \times p)(p \times n)] \\\\ &= (1 \times q)(q \times n) \\\\ &= (1 \times n) \\\\ &= (\nabla L)^T \end{align*} \]

The Derivative of \(f:\mathbb{R}^n \rightarrow \mathbb{R}^n\)

Jacobian

Chain Ruke

Backpropagation