The Derivative of Scalar Functions of Matrices

Derivative of the Frobenius norm Derivative of the Determinant

Derivative of the Frobenius norm

Derivatives implicitly rely on norms to measure the magnitude of changes in both the input \(dx\) and the output \(df\) ensuring a consistent comparison of their scales.

The Frobenius norm of a matrix \(X \in \mathbb{R}^{m \times n}\) is defined as: \[ f(X) = \| X \|_F = \sqrt{\text{tr }(X^TX)} \]
Now, taking the differential \(df\):

First, by the chain rule: \[ df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}d[\text{tr }(X^TX)] \] Note: for any matrix \(A\), \[ \begin{align*} d(\text{tr }(A)) &= \text{tr }(A + dA) - \text{tr }(A) \\\\\ &= \text{tr }(A) + \text{tr }(dA) - \text{tr }(A) \\\\ &= \text{tr }(dA) \end{align*} \] Thus: \[ df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[d(X^TX)] \] By the product rule: \[ \begin{align*} df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[dX^TX + X^TdX]\\\\ &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }(dX^TX) + \text{tr }(X^TdX) \end{align*} \] Since \(\text{tr }(dX^TX) = \text{tr }((dX^TX)^T) = \text{tr }(X^TdX)\), \[ \begin{align*} df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}2\text{tr }(X^TdX)\\\\ &= \frac{1}{\sqrt{\text{tr }(X^TX)}}\text{tr }(X^TdX) \end{align*} \] Here, \(\text{tr }(X^TdX)\) represents the Frobenius inner product of \(X\) and \(dX\). Then: \[ df = \left\langle \frac{X}{\sqrt{\text{tr }(X^TX)}}, dX \right\rangle_F \tag{1} \] Therefore, \[ \nabla f = \frac{X}{ \| X \|_F}. \]
Note:The expression in (1) is equivalent to \[ df = \text{tr }((\nabla f)^TdX) \tag{2} \]
The trace operator satisfies linearity and the cyclic property, making it a convenient way to express derivatives in terms of gradients.
For example, consider \(f(A) = x^TAy\) where \(A\) is a \(m \times n\) matrix, \(x \in \mathbb{R}^m\), and \(y \in \mathbb{R}^n\).
By the product rule, \[ df = x^TdAy \] Since \(df\) is a scalar, taking the trace does not change its value: \[ df = \text{tr }(x^TdAy) \] By the cyclic property of the trace: \[ df = \text{tr }(yx^TdA) \] Therefore, comparing this with \(df = \text{tr }((\nabla f)^TdA)\), \[ \nabla f = (yx^T)^T = xy^T \]

Derivative of the Determinant

The derivative of the determinant of a square matrix \(A \in \mathbb{R}^{n \times n}\) be expressed using several equivalent forms.
Recall: \[ \text{adj }(A) = \text{cofactor }(A)^T = (\det A)A^{-1}. \] This implies: \[ \text{cofactor }(A) = \text{adj }(A)^T = (\det A)(A^{-1})^T. \]

By the cofactor expansion of the determinant based on \(i\)-th row of \(A\): \[ \det (A) = A_{i1}C_{i1} + A_{i2}C_{i2} + \cdots + A_{in}C_{in} \] Thus, \(\frac{\partial \det A}{\partial A_{ij}} = C_{ij} \) and then: \[ \nabla (\det A) = \text{cofactor } (A) \] Equivalently, using the expression (2): \[ \begin{align*} d (\det A) &= \text{tr }(\text{cofactor }(A)^T dA) \\\\ &= \text{tr }(\text{adj }(A) dA) \\\\ &= \text{tr }((\det A)A^{-1} dA) \tag{3} \end{align*} \] Therefore, \[ \nabla (\det A) = \text{cofactor } (A) = \text{adj }(A)^T = (\det A)(A^{-1})^T \]
Since \(det A\) is a scalar, the expression (3) can be written as: \[ d(\det A) = (\det A)\text{tr }(A^{-1} dA) \tag{4} \] Consider the scalar function \(p(x) = \det(xI - A)\), which is the characteristic polynomial of \(A\). In practice, when approximating eigenvalues using numerical methods, we often need to compute the derivative of \(p(x)\) at different values of \(x\).
Applying the expression (4), we have: \[ d(\det (xI-A)) = (\det (xI-A)) \text{tr } ((xI-A)^{-1} d(xI -A)) \] Since \(A\) is constant, \(d(xI -A) = dxI\), and also \(dx\) is a scalar. Thus, \[ d(\det (xI-A)) = (\det (xI-A)) \text{tr } ((xI-A)^{-1})dx \]
Note: While the analytical approach provides a useful formula for the differential of many functions, in practice, automatic differentiation(auto-diff) offers a more stable and efficient way to compute the gradient of functions. Auto-diff allows us to compute the derivative with respect to matrix parameters directly without explicitly computing like \(A^{-1}\), which can introduce numerical instability in certain cases.

Here's an example using auto-diff in PyTorch vs using the analytical formula to compute the derivative of \(p(x)\):

                import torch

                # Random square matrix
                def generate_matrix(n):
                    return torch.randn(n, n, dtype=torch.float64, requires_grad=False)

                # Characteristic polynomial p(x) = det(xI - A)
                def p(x, A):
                    return torch.det(x * torch.eye(A.shape[0], dtype=A.dtype, device=A.device) - A)

                # Derivative of p(x) using auto-differentiation
                def dp_torch(x, A):
                    x_tensor = torch.tensor([x], requires_grad=True, dtype=A.dtype, device=A.device)
                    grad = torch.autograd.grad(p(x_tensor, A), x_tensor)[0]
                    return grad.item()

                # Analytical formula d(p(x)) = (det (xI-A))*tr((xI-A)^-1)dx 
                def dp(x, A):
                    return (
                        p(x, A).item() *
                        torch.trace(
                            torch.inverse(x * torch.eye(A.shape[0], dtype=A.dtype, device=A.device) - A)
                        ).item()
                    )