The Derivative of \(f:\mathbb{R}^n \rightarrow \mathbb{R}\)

Linear Approximations Differentials Example 1: \(f(x) = x^Tx\) where \(x \in \mathbb{R}^n\) Example 2: \(f(x) = x^TAx\) where \(x \in \mathbb{R}^n\) and \(A \in \mathbb{R}^{n \times n}\) Example 3: \(f(x) = \| x \|_2\) where \(x \in \mathbb{R}^n\)

Linear Approximations

Linear approximation is a process of approximation a function (\f(x)\) near a point \(x_o\) using a linear function. It simplifies complex functions locally. \[ L(x) = f(x_o) + f'(x_o)(x -x_o) \approx f(x) \tag{1} \] where \(f'(x_o) = \lim_{x \to x_o} \frac{f(x)-f(x_o)}{x-x_o}\) is derivative at \(x_o\).
Equivalently, from equation (1), \[ f(x) - f(x_o) \approx f'(x_o)(x -x_o) \]
\(L(x)\) is called the linearization of \(f\) at \(x_o\). the graph of \(L\) is the tangent line at \((x_o, f(x_o))\). Linear approximations form the foundation of differentiation and provide a local linear model for \(f(x)\). This concept extends naturally to differentials.

Differentials

Differentials describe infinitesimally small changes in quantities. If \(y = f(x)\), where \(f\) is a differentiable function, then the differential \(dx\) is an independent variable and the differential \(dy\) is defined as \[ dy = f'(x)dx. \] In practice, \(dx\) and \(dy\) are arbitrary small numbers. Also, if \(dx \neq 0\), then we recover the familiar derivative form: \[ \frac{dy}{dx} = f'(x) \] where the left side now represents the ratio of differentials.

The exact change in \(y\) corresponding to change in \(x\),(\(dx\)) is given by: \[ \delta y = f(x + dx) -f(x). \] As \(dx \to 0\), the approximation improves \[ dy = f'(x)dx \approx \delta y = f(x + dx) -f(x). \] More precisely, the relationship can be written as: \[ f(x + dx) - f(x) = f'(x)dx + o(dx), \] where \(o(dx)\) is the asymptotic notation, which represents higher-order terms that become negligible as \(dx \to 0\). This highlights that the differential \(dy = f'(x)dx\) serves as a linear approximation to \(\delta y\).

More generally, the differential of \(f\) can be expressed as: \[ df = f(x + dx) - f(x) = f'(x)dx, \] where \(df\) represents the linearized change in \(f(x)\) due to an infinitesimally small change in \(x\).

Here, \(df\) is the change in the output, \(dx\) is the change in the input, and most importantly, \(f'(x)\) acts as a linear operator that maps \(dx\) to \(df\).
The flexibility of differential notation extends naturally to linear algebra, where derivatives apply not only to scalars \(x \in \mathbb{R}\), but also to vectors \(\vec{x} \in \mathbb{R}^n\) and matrices \(X \in \mathbb{R}^{m \times n}\).

Example 1: \(f(x) = x^Tx\) where \(x \in \mathbb{R}^n\)

In this case, the input is the vector \(x\), and the output is the scalar \(x^Tx\). To compute the derivative of this function, we start with: \[ f(x) = x^Tx = \sum_{i=1}^n x_i^2 , \] where \(x_i\) is the \(i\)-th entry of the vector \(x\). Then the gradient of \(f\) is: \[ \nabla f = \begin{bmatrix} \frac{\partial f }{\partial x_1} \\ \frac{\partial f }{\partial x_2} \\ \vdots \\ \frac{\partial f }{\partial x_n} \end{bmatrix} = \begin{bmatrix} 2x_1 \\ 2x_2\\ \vdots \\ 2x_n \end{bmatrix} = 2x \] Note: Input: vector & Output: scalar \(\Longrightarrow\) First derivative: column vector (gradient).
Now, let's derive the same result using differential notation. Note: \(dx \in \mathbb{R}^n\).

By the product rule, and the commutativity of the vector inner product: \[ \begin{align*} d(x^Tx) &= (dx^T)x + x^T(dx) \\\\ &= x^Tdx + x^Tdx \\\\ &= 2x^Tdx. \end{align*} \] Note: \(dx^T = (dx)^T\) becuase \[\begin{align*} d(x^T) &= (x + dx)^T - x^T \\\\ &= x^T + (dx)^T - x^T \\\\ &= (dx)^T \end{align*}. \] Thus the gradient is \[ \nabla f = (2x^T)^T = 2x. \] Note: \(2x^T\) is a "row" vector and to get a column vector \(\nabla f\), we need the transpose of \(2x^T\).

Theorem 1: Product rule If both \(g\) and \(h\) are differentiable and let \(f(x) = g(x)h(x)\), then \[ df = (dg)h + g(dh). \tag{1} \] Quick derivation: \[ \begin{align*} df &= g(x+dx)h(x+dx) - g(x)h(x) \\\\ &= [g(x) +dg][h(x) + dh]-g(x)h(x) \\\\ &= g(x)h(x) + (dg)h + g(dh) + (dg)(dh) -g(x)h(x) \end{align*} \] Then \((dg)(dh)\) is negligible, and we get (1).

Example 2: \(f(x) = x^TAx\) where \(x \in \mathbb{R}^n\) and \(A \in \mathbb{R}^{n \times n}\)

Let's use the differential representation: \[\begin{align*} df &= f(x + dx) -f(x) \\\\ &= (x + dx)^T A (x + dx) - x^T A x \end{align*} \] Expanding this expression, we get: \[ df = x^TAx + dx^T A x + x^TAdx + dx^T A dx - x^T A x \] It is valid to ignore the higher-order term \(dx^T A dx\), which becomes negligible as \(dx \to 0\). Then \[ df = dx^TAx + x^TAdx. \] since \(dx^TAx\) is a scalar, \((dx^TAx)^T = x^TA^Tdx\). Then we get: \[ df = x^TA^Tdx + x^TAdx = x^T(A^T + A)dx \] Here, \[ (A^T + A)^T = A + A^T = (A^T + A) \]. So, \((A^T + A)^T\) is symmetric and thus: \[ \nabla f = (A+A^T)x. \] if \(A\) is symmetric (\(x^TAx\) is a quadratic form), \(\nabla f = (A+A^T)x = (A+A)x = 2Ax\).
Also, Example 1 is a special case: \(A\) is an identity matrix.

Example 3: \(f(x) = \| x \|_2\) where \(x \in \mathbb{R}^n\)

Now, it is simple to find the derivative of \(L_2\) norm.
Let \(r = \| x \|\), and then: \[ \begin{align*} & r^2 = x^Tx \\\\ &\Longrightarrow 2rdr = 2x^Tdr \\\\ &\Longrightarrow dr = \frac{x^T}{r} = \frac{x^T}{\| x \|} \\\\\ &\Longrightarrow \nabla f = \frac{x}{\| x \|}. \end{align*} \]