In multilayer perceptrons (MLPs), backpropagation is an efficient application of the chain rule to compute gradients layer by layer.
More generally, this technique is known as automatic differentiation (AD). AD is not limited to sequential layers. It applies to arbitrary
computational graphs, which are directed acyclic graphs (DAGs) that represent how variables are computed from inputs to outputs.
Automatic differentiation systematically applies the chain rule over this graph structure to compute exact derivatives.
In reverse-mode AD — which underlies backpropagation — we begin at the output node and work backwards,
accumulating gradients with respect to each variable.
Analytic Example of Reverse-Mode AD
To make the process of automatic differentiation concrete, let's walk through an analytic example using a
composite scalar-valued function of two variables. We'll decompose the function into primitive operations,
represent it as a computational graph, and compute its gradients using reverse-mode automatic differentiation
(i.e., backpropagation).
Consider a function
\[
f(x_1, x_2) = \log \left((x_1 + x_2)^2 + \sin(x_1 x_2) \right).
\]
We decompose this into primitive operations:
\[
\begin{align*}
&x_3 = x_1 + x_2 \\\\
&x_4 = x_3^2 \\\\
&x_5 = x_1 x_2 \\\\
&x_6 = \sin(x_5) \\\\
&x_7 = x_4 + x_6 \\\\
&x_8 = \log(x_7) = f \\\\
\end{align*}
\]
This computational graph clearly shows the DAG (Directed Acyclic Graph) structure. Notice how:
Each input variable (x₁ and x₂) has multiple outgoing edges, contributing to different intermediate computations
The graph flows from inputs at the top to the output at the bottom
During backpropagation, gradients flow in the reverse direction (from f back to x₁ and x₂)
The power of automatic differentiation lies in its systematic approach:
Decompose complex functions into simple primitive operations
Apply the chain rule mechanically through the computational graph
Sum gradients when variables contribute through multiple paths
This process can be fully automated, making it the backbone of modern deep learning frameworks.
Applications of AD
Automatic differentiation is a core component in modern computational systems that require efficient and
accurate derivatives. In particular, it powers nearly all deep learning frameworks such as:
PyTorch — dynamic computational graphs with reverse-mode AD via autograd
TensorFlow — supports both eager and static (graph) modes of AD with tf.GradientTape
JAX — composable transformations like grad, vmap, jit based on function tracing and XLA compilation