Automatic Differentiation

In multilayer perceptrons (MLPs), backpropagation is an efficient application of the chain rule to compute gradients layer by layer. More generally, this technique is known as automatic differentiation (AD). AD is not limited to sequential layers. It applies to arbitrary computational graphs, which are directed acyclic graphs (DAGs) that represent how variables are computed from inputs to outputs.

Automatic differentiation systematically applies the chain rule over this graph structure to compute exact derivatives. In reverse-mode AD — which underlies backpropagation — we begin at the output node and work backwards, accumulating gradients with respect to each variable.

Analytic Example of Reverse-Mode AD

To make the process of automatic differentiation concrete, let's walk through an analytic example using a composite scalar-valued function of two variables. We'll decompose the function into primitive operations, represent it as a computational graph, and compute its gradients using reverse-mode automatic differentiation (i.e., backpropagation).

Consider a function \[ f(x_1, x_2) = \log \left((x_1 + x_2)^2 + \sin(x_1 x_2) \right). \] We decompose this into primitive operations: \[ \begin{align*} &x_3 = x_1 + x_2 \\\\ &x_4 = x_3^2 \\\\ &x_5 = x_1 x_2 \\\\ &x_6 = \sin(x_5) \\\\ &x_7 = x_4 + x_6 \\\\ &x_8 = \log(x_7) = f \\\\ \end{align*} \]

This computational graph clearly shows the DAG (Directed Acyclic Graph) structure. Notice how:

Each input variable (x₁ and x₂) has multiple outgoing edges, contributing to different intermediate computations
The graph flows from inputs at the top to the output at the bottom
During backpropagation, gradients flow in the reverse direction (from f back to x₁ and x₂)

Starting from the output and working backwards: \[ \begin{align*} \frac{\partial f}{\partial x_8} &= 1 \\\\ \frac{\partial f}{\partial x_7} &= \frac{\partial f}{\partial x_8} \cdot \frac{\partial x_8}{\partial x_7} \\\\ &= 1 \cdot \frac{1}{x_7} = \frac{1}{x_7} \\\\ \frac{\partial f}{\partial x_4} &= \frac{\partial f}{\partial x_7} \cdot \frac{\partial x_7}{\partial x_4} \\\\ &= \frac{1}{x_7} \cdot 1 = \frac{1}{x_7} \\\\ \frac{\partial f}{\partial x_6} &= \frac{\partial f}{\partial x_7} \cdot \frac{\partial x_7}{\partial x_6} \\\\ &= \frac{1}{x_7} \cdot 1 = \frac{1}{x_7} \\\\ \frac{\partial f}{\partial x_3} &= \frac{\partial f}{\partial x_4} \cdot \frac{\partial x_4}{\partial x_3} \\\\ &= \frac{1}{x_7} \cdot 2 x_3\\\\ \frac{\partial f}{\partial x_5} &= \frac{\partial f}{\partial x_6} \cdot \frac{\partial x_6}{\partial x_5} \\\\ &= \frac{1}{x_7} \cdot \cos(x_5)\\\\ \end{align*} \]

Notice that the input variables \(x_1\) and \(x_2\) each contribute to multiple intermediate nodes:

\(x_1\) influences both \(x_3\) (via addition) and \(x_5\) (via multiplication)
\(x_2\) influences both \(x_3\) (via addition) and \(x_5\) (via multiplication)

This means we need to sum the gradients from all paths when computing the final derivatives.

To find the gradients with respect to the input variables, we sum contributions from all paths:

For \(\frac{\partial f}{\partial x_1}\): \[ \begin{align*} \frac{\partial f}{\partial x_1} &= \frac{\partial f}{\partial x_3} \cdot \frac{\partial x_3}{\partial x_1} + \frac{\partial f}{\partial x_5} \cdot \frac{\partial x_5}{\partial x_1} \\\\ &= \frac{2x_3}{x_7} \cdot 1 + \frac{\cos(x_5)}{x_7} \cdot x_2 \\\\ &= \frac{1}{x_7} \left[2x_3 + x_2 \cos(x_5)\right] \end{align*} \]

For \(\frac{\partial f}{\partial x_2}\): \[ \begin{align*} \frac{\partial f}{\partial x_2} &= \frac{\partial f}{\partial x_3} \cdot \frac{\partial x_3}{\partial x_2} + \frac{\partial f}{\partial x_5} \cdot \frac{\partial x_5}{\partial x_2} \\\\ &= \frac{2x_3}{x_7} \cdot 1 + \frac{\cos(x_5)}{x_7} \cdot x_1 \\\\ &= \frac{1}{x_7} \left[2x_3 + x_1 \cos(x_5)\right] \end{align*} \]

Replacing the intermediate variables with their expressions in terms of \(x_1\) and \(x_2\):

\(x_3 = x_1 + x_2\)
\(x_5 = x_1 x_2\)
\(x_7 = (x_1 + x_2)^2 + \sin(x_1 x_2)\)

Finally, we get the derivatives: \[ \boxed{ \begin{align*} \frac{\partial f}{\partial x_1} &= \frac{2(x_1 + x_2) + x_2 \cos(x_1 x_2)}{(x_1 + x_2)^2 + \sin(x_1 x_2)} \\\\ \frac{\partial f}{\partial x_2} &= \frac{2(x_1 + x_2) + x_1 \cos(x_1 x_2)}{(x_1 + x_2)^2 + \sin(x_1 x_2)} \end{align*} } \]

The power of automatic differentiation lies in its systematic approach:

Decompose complex functions into simple primitive operations
Apply the chain rule mechanically through the computational graph
Sum gradients when variables contribute through multiple paths

This process can be fully automated, making it the backbone of modern deep learning frameworks.

Applications of AD

Automatic differentiation is a core component in modern computational systems that require efficient and accurate derivatives. In particular, it powers nearly all deep learning frameworks such as:

PyTorch — dynamic computational graphs with reverse-mode AD via autograd
TensorFlow — supports both eager and static (graph) modes of AD with tf.GradientTape
JAX — composable transformations like grad, vmap, jit based on function tracing and XLA compilation
Diffrax, SciML — scientific computing libraries for differentiable differential equations (ODEs, PDEs)

These systems rely on automatic differentiation to:

Train neural networks by computing gradients of loss functions with respect to millions (or billions) of parameters
Optimize black-box functions in physics simulation, robotics, and finance
Perform end-to-end differentiation through control flow, dynamic loops, and even solver calls (e.g., differentiable physics)

Sample Code

                                import numpy as np

                                class AutoDiffNode:
                                    """Node in the computational graph for automatic differentiation"""
                                    def __init__(self, value, grad=0.0):
                                        self.value = value
                                        self.grad = grad
                                        self.children = []  # Nodes that depend on this node
                                        self.local_gradients = []  # Local gradients to children

                                def manual_autodiff_example(x1_val, x2_val):
                                    """
                                    Manual implementation of automatic differentiation for:
                                    f(x1, x2) = log((x1 + x2)^2 + sin(x1 * x2))
                                    
                                    This demonstrates the forward and backward pass explicitly.
                                    """
                                    print(f"Computing f({x1_val}, {x2_val}) = log((x1 + x2)² + sin(x1 * x2))")
                                    print("="*60)
                                    
                                    # Forward Pass - Compute function value
                                    print("FORWARD PASS:")
                                    x1 = x1_val
                                    x2 = x2_val
                                    print(f"x1 = {x1}")
                                    print(f"x2 = {x2}")
                                    
                                    x3 = x1 + x2
                                    print(f"x3 = x1 + x2 = {x3}")
                                    
                                    x4 = x3**2
                                    print(f"x4 = x3² = {x4}")
                                    
                                    x5 = x1 * x2
                                    print(f"x5 = x1 * x2 = {x5}")
                                    
                                    x6 = np.sin(x5)
                                    print(f"x6 = sin(x5) = {x6}")
                                    
                                    x7 = x4 + x6
                                    print(f"x7 = x4 + x6 = {x7}")
                                    
                                    x8 = np.log(x7)
                                    f = x8
                                    print(f"x8 = log(x7) = {f}")
                                    print(f"\nFunction value: f = {f}")
                                    
                                    # Backward Pass - Compute gradients
                                    print("\n" + "="*60)
                                    print("BACKWARD PASS:")
                                    
                                    # Initialize gradient
                                    df_dx8 = 1.0
                                    print(f"∂f/∂x8 = {df_dx8}")
                                    
                                    # x8 = log(x7)
                                    df_dx7 = df_dx8 * (1.0 / x7)
                                    print(f"∂f/∂x7 = ∂f/∂x8 * ∂x8/∂x7 = {df_dx8} * (1/{x7}) = {df_dx7}")
                                    
                                    # x7 = x4 + x6
                                    df_dx4 = df_dx7 * 1.0
                                    df_dx6 = df_dx7 * 1.0
                                    print(f"∂f/∂x4 = ∂f/∂x7 * ∂x7/∂x4 = {df_dx7} * 1 = {df_dx4}")
                                    print(f"∂f/∂x6 = ∂f/∂x7 * ∂x7/∂x6 = {df_dx7} * 1 = {df_dx6}")
                                    
                                    # x4 = x3²
                                    df_dx3 = df_dx4 * (2 * x3)
                                    print(f"∂f/∂x3 = ∂f/∂x4 * ∂x4/∂x3 = {df_dx4} * 2*{x3} = {df_dx3}")
                                    
                                    # x6 = sin(x5)
                                    df_dx5 = df_dx6 * np.cos(x5)
                                    print(f"∂f/∂x5 = ∂f/∂x6 * ∂x6/∂x5 = {df_dx6} * cos({x5}) = {df_dx5}")
                                    
                                    # Now accumulate gradients for x1 and x2
                                    # x3 = x1 + x2
                                    df_dx1_from_x3 = df_dx3 * 1.0
                                    df_dx2_from_x3 = df_dx3 * 1.0
                                    
                                    # x5 = x1 * x2
                                    df_dx1_from_x5 = df_dx5 * x2
                                    df_dx2_from_x5 = df_dx5 * x1
                                    
                                    # Sum gradients from all paths
                                    df_dx1 = df_dx1_from_x3 + df_dx1_from_x5
                                    df_dx2 = df_dx2_from_x3 + df_dx2_from_x5
                                    
                                    print(f"\n∂f/∂x1 = ∂f/∂x3 * ∂x3/∂x1 + ∂f/∂x5 * ∂x5/∂x1")
                                    print(f"       = {df_dx3} * 1 + {df_dx5} * {x2}")
                                    print(f"       = {df_dx1_from_x3} + {df_dx1_from_x5}")
                                    print(f"       = {df_dx1}")
                                    
                                    print(f"\n∂f/∂x2 = ∂f/∂x3 * ∂x3/∂x2 + ∂f/∂x5 * ∂x5/∂x2")
                                    print(f"       = {df_dx3} * 1 + {df_dx5} * {x1}")
                                    print(f"       = {df_dx2_from_x3} + {df_dx2_from_x5}")
                                    print(f"       = {df_dx2}")
                                    
                                    # Verify with the closed form
                                    print("\n" + "="*60)
                                    print("VERIFICATION WITH CLOSED FORM:")
                                    expected_df_dx1 = (2*(x1 + x2) + x2*np.cos(x1*x2)) / ((x1 + x2)**2 + np.sin(x1*x2))
                                    expected_df_dx2 = (2*(x1 + x2) + x1*np.cos(x1*x2)) / ((x1 + x2)**2 + np.sin(x1*x2))
                                    
                                    print(f"Expected ∂f/∂x1 = {expected_df_dx1}")
                                    print(f"Expected ∂f/∂x2 = {expected_df_dx2}")
                                    print(f"Error in ∂f/∂x1: {abs(df_dx1 - expected_df_dx1)}")
                                    print(f"Error in ∂f/∂x2: {abs(df_dx2 - expected_df_dx2)}")
                                    
                                    return f, df_dx1, df_dx2


                                def pytorch_autodiff_example(x1_val, x2_val):
                                    """
                                    PyTorch implementation showing how modern autodiff frameworks handle this
                                    """
                                    import torch
                                    
                                    print("\n" + "="*60)
                                    print("PYTORCH AUTOMATIC DIFFERENTIATION:")
                                    
                                    # Create tensors with gradient tracking
                                    x1 = torch.tensor(x1_val, requires_grad=True, dtype=torch.float32)
                                    x2 = torch.tensor(x2_val, requires_grad=True, dtype=torch.float32)
                                    
                                    # Define the function
                                    f = torch.log((x1 + x2)**2 + torch.sin(x1 * x2))
                                    
                                    # Compute gradients
                                    f.backward()
                                    
                                    print(f"f({x1_val}, {x2_val}) = {f.item()}")
                                    print(f"∂f/∂x1 = {x1.grad.item()}")
                                    print(f"∂f/∂x2 = {x2.grad.item()}")
                                    
                                    return f.item(), x1.grad.item(), x2.grad.item()


                                def gradient_check(f, x1, x2, epsilon=1e-7):
                                    """
                                    Numerical gradient checking using finite differences
                                    """
                                    # Compute analytical gradients
                                    df_dx1, df_dx2 = manual_autodiff_example(x1, x2)
                                    
                                    print("\n" + "="*60)
                                    print("NUMERICAL GRADIENT CHECK:")
                                    
                                    # Numerical gradient for x1
                                    def eval_f(x1_val, x2_val):
                                        return np.log((x1_val + x2_val)**2 + np.sin(x1_val * x2_val))
                                    
                                    f_plus_x1 = eval_f(x1 + epsilon, x2)
                                    f_minus_x1 = eval_f(x1 - epsilon, x2)
                                    numerical_df_dx1 = (f_plus_x1 - f_minus_x1) / (2 * epsilon)
                                    
                                    # Numerical gradient for x2
                                    f_plus_x2 = eval_f(x1, x2 + epsilon)
                                    f_minus_x2 = eval_f(x1, x2 - epsilon)
                                    numerical_df_dx2 = (f_plus_x2 - f_minus_x2) / (2 * epsilon)
                                    
                                    print(f"Analytical ∂f/∂x1: {df_dx1}")
                                    print(f"Numerical  ∂f/∂x1: {numerical_df_dx1}")
                                    print(f"Difference: {abs(df_dx1 - numerical_df_dx1)}")
                                    
                                    print(f"\nAnalytical ∂f/∂x2: {df_dx2}")
                                    print(f"Numerical  ∂f/∂x2: {numerical_df_dx2}")
                                    print(f"Difference: {abs(df_dx2 - numerical_df_dx2)}")


                                if __name__ == "__main__":
                                    # Test with specific values
                                    x1 = 1.0
                                    x2 = 0.5
                                    
                                    # Manual implementation
                                    f_manual, grad_x1_manual, grad_x2_manual = manual_autodiff_example(x1, x2)
                                    
                                    # PyTorch implementation 
                                    f_pytorch, grad_x1_pytorch, grad_x2_pytorch = pytorch_autodiff_example(x1, x2)
                                
                                    # Numerical gradient check
                                    gradient_check(None, x1, x2)

Automatic Differentiation

Machine Learning