Derivative of the Frobenius norm
Derivatives implicitly rely on norms to measure the magnitude of changes in both the input \(dx\)
and the output \(df\) ensuring a consistent comparison of their scales.
The Frobenius norm of a matrix \(X \in \mathbb{R}^{m \times n}\) is defined as:
\[
f(X) = \| X \|_F = \sqrt{\text{tr }(X^TX)}
\]
Now, taking the differential \(df\):
First, by the chain rule:
\[
df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}d[\text{tr }(X^TX)]
\]
Note: for any matrix \(A\),
\[
\begin{align*}
d(\text{tr }(A)) &= \text{tr }(A + dA) - \text{tr }(A) \\\\\
&= \text{tr }(A) + \text{tr }(dA) - \text{tr }(A) \\\\
&= \text{tr }(dA)
\end{align*}
\]
Thus:
\[
df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[d(X^TX)]
\]
By the product rule:
\[
\begin{align*}
df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[dX^TX + X^TdX]\\\\
&= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }(dX^TX) + \text{tr }(X^TdX)
\end{align*}
\]
Since \(\text{tr }(dX^TX) = \text{tr }((dX^TX)^T) = \text{tr }(X^TdX)\),
\[
\begin{align*}
df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}2\text{tr }(X^TdX)\\\\
&= \frac{1}{\sqrt{\text{tr }(X^TX)}}\text{tr }(X^TdX)
\end{align*}
\]
Here, \(\text{tr }(X^TdX)\) represents the Frobenius inner product of \(X\) and \(dX\). Then:
\[
df = \left\langle \frac{X}{\sqrt{\text{tr }(X^TX)}}, dX \right\rangle_F \tag{1}
\]
Therefore,
\[
\nabla f = \frac{X}{ \| X \|_F}.
\]
Note:The expression in (1) is equivalent to
\[
df = \text{tr }((\nabla f)^TdX) \tag{2}
\]
The trace operator satisfies linearity and the cyclic property, making it a convenient way to express derivatives
in terms of gradients.
For example, consider \(f(A) = x^TAy\) where \(A\) is
a \(m \times n\) matrix, \(x \in \mathbb{R}^m\), and \(y \in \mathbb{R}^n\).
By the product rule,
\[
df = x^TdAy
\]
Since \(df\) is a scalar, taking the trace does not change its value:
\[
df = \text{tr }(x^TdAy)
\]
By the cyclic property of the trace:
\[
df = \text{tr }(yx^TdA)
\]
Therefore, comparing this with \(df = \text{tr }((\nabla f)^TdA)\),
\[
\nabla f = (yx^T)^T = xy^T
\]