Mean Value Theorem

Lagrange's Mean Value Theorem

Theorem 1: Lagrange's Mean Value Theorem Suppose a function \(f\) is a continuous on \([a, b]\) and differentiable on \((a, b)\). Then there exists a number \(c \in (a, b)\) such that \[ f'(c) = \frac{f(b) - f(a)}{b - a} \tag{1} \]

Note: The special case of this theorem: if \(f(a) = f(b)\), then f'(c) = 0. is called Rolle's Theorem. We ommit the proof but the outile of the proof can be:

case 1: \(f(x)\) is constant, \(f'(x)\) = 0 for all \(c \in (a,b)\).
case 2: \(f(x) > f(a)\), show that at \(c\), \(f\) has local maximum, then \(f'(c) = 0\) by the Extrem Value Theorem.
case 3: \(f(x) < f(a)\), similarly, show that at \(c\), \(f\) has local minimum, then \(f'(c) = 0\).

Proof: Suppose a function \(f\) is a continuous on \([a, b]\) and differentiable on \((a, b)\).
Consider the line connecting \(a, f(a)\) and \(b, f(b)\). The equation of the line can be written as \[ y - f(a) = \frac{f(b)-f(a)}{b -a} (x - a) \] Let g(x) be the vertical difference between \(x, f(x)\) and (x, y). Then \[ g(x) = f(x) - f(a) - \frac{f(b)-f(a)}{b -a} (x -a) \tag{2} \] \(g\) is continuous on \([a, b]\) and differentiable on \((a, b)\) because \(g\) is the sum of \(f\) and the first-degree polynomial, both of which are continuous and differentialble on the same interval.
From Equation (2), \[ g'(x) = f'(x) = \frac{f(b)-f(a)}{b -a} \] Also, clearly, \(g(a) = g(b) = 0\). Thus, by Rolle's Theorem \[ \exists c \in (a,b) \text{ such that } g'(c) = 0 \] This implies \[ g'(c) = f'(c) = \frac{f(b)-f(a)}{b -a} = 0 \] and so \[ f'(c) = \frac{f(b)-f(a)}{b -a} \]

Taylor's Theorem

Equivalently, Equation (1) can be written as \[ f(b) - f(a) = f'(c)(b - a) \] This is the Taylor's Theorem on \([a, b]\) with \(n = 1\).

Theorem 2: Taylor's Theorem Let \(n \geq 1\) be an integer and let \(f: \mathbb{R} \to \mathbb{R}\) be \(n\) times differentiable at \(a \in \mathbb{R}\). Then there exists \(g: \mathbb{R} \to \mathbb{R}\) such that \[ f(x) = \sum_{k = 0}^{n-1} \frac{f^{(k)}(a)}{k !}(x - a)^k + g_n (x) (x - a)^n \] and \[ \lim_{x \to a} g_n(x) = 0. \]

Note: \(g_n (x) (x - a)^n \) represents the reminder. An alternative and commonly used form of the remainder is the Lagrange's form of reminder: \[ R_n (x) = \frac{f^{(n)}(c)}{n!}(x-a)^n \text{ where } c \in (a, x) \] Another common representation of the remainder, particularly in asymptotic analysis, is: \[ R_n (x) = \frac{f^{(n)}(a)}{n!}(x-a)^n + o(| x -a |^n) \] where the little-o notation \(o(| x -a |^n)\) indicates that the term converges to 0 faster than \(| x -a |^n\) as \(x \to a\).
The polynomial appearing in Taylor's Theorem is called n-th order Taylor polynomial: \[ \begin{align*} T_n (x) &= f(a) + \frac{f'(a)}{1!}(x -a) + \frac{f''(a)}{2!}(x -a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x - a)^n \\\\ &= \sum_{k = 0}^{n} \frac{f^{(n)}(a)}{n !}(x - a)^n \end{align*} \] Also, as \(n \to \infty\), if the function \(f\) is infinitely differentiable and the remainder term \(R_n (x)\) tends to zero, Taylor series expansion becomes: \[ f(x) = \sum_{n = 0}^{\infty} \frac{f^{(n)}(a)}{n !}(x - a)^n \] This is Taylor series of the function \(f\) centered at \(a\). The series converges to \(f\) within it radius of convergence. So, Taylor polynomials are used for approximation of \(f(x)\). For example, the 1st order Taylor polynomial \[ T_1 (x) = f(a) + f'(a)(x-a) \] is equivalent ot the linearlization of \(f\) at \(a\), providing a first-order approximation.

Cauchy's Mean Value Theorem

Theorem 3: Cauchy's Mean Value Theorem Let both \(f\) and \(g\) are continuous on \([a,b]\) and differentiable \((a,b)\). If \(\forall x \in (a, b), \, g'(x) \neq 0\) then \(\exists c \in (a, b)\) such that \[ \frac{f(b)-f(a)}{g(b)-g(a)} = \frac{f'(c)}{g'(c)}. \]

Note: Cauchy's Mean Value Theorem generalizes Lagrange's MVT(\(g(x) = x\)) and can be useful in advanced mathematical contexts, but is not commonly applied directly in machine learning.

Higher-dimensional MVT

Before moving to optimization topics, it is important to discuss the higher-dimensional version of the MVT. This theorem is foundational in gradient descent algorithms, which rely on gradients to minimize loss functions in multidimensional spaces. Like its one-dimensional counterpart, the higher-dimensional leads to the higher-dimensional Taylor expansions.

Higher-Dimensional MVT Let \(f: X \to \mathbb{R}\) be a differentiable function where X is an open subset of \(\mathbb{R}^n\).
For any points \(a, b \in X\) such that \( [a, b] \subset X\), \(\, \exists c \in (a,b) \text{ such that }\) \[ f(b) - f(a) = \nabla f(c) \cdot (b-a). \]
Note: \([a, b] = \{(1-t)a + tb \, | \, t \in [0, 1] \}\) and \((a, b) = \{(1-t)a + tb \, | \, t \in (0, 1) \}\)

Proof: Define \(h(t) = f((1-t)a + tb) = f(a + t(b-a))\) where \(t \in \mathbb{R}\).
By Theorem 1: Lagrange's Mean Value Theorem,
\(\exists \theta \in (0,1)\) such that \[ h'(\theta ) = h(1) - h(0) = f(b) - f(a). \] Also, \[ h'(\theta ) = \nabla f(a + \theta (b -a)) \cdot (b - a). \] Here, let \(c = a + \theta (b-a)\), then we obtain \[ \nabla f(c) \cdot (b - a) = f(b) - f(a). \]

Note: This theorem is only for vector-in & "scalar"-out functions.

Calculus to Optimization & Analysis

Lagrange's Mean Value Theorem

Taylor's Theorem

Cauchy's Mean Value Theorem

Higher-dimensional MVT