Trace & Norms

Trace

The trace of an \(n \times n\) matrix \(A\) is the sum of the diagonal entries of \(A\) and denoted by \(\text{tr }(A)\). \[ \text{tr }(A) =\sum_{k=1}^n a_{kk} \] We list some properties of the trace.

\(\text{tr }(cA) = c (\text{tr }A) \quad , c \in \mathbb{R}\)
\(\text{tr }(A+B) = \text{tr }A + \text{tr }B \quad , B \in \mathbb{R}^{n \times n}\)
\(\text{tr }A^T = \text{tr }A \)

\(\text{tr }(A^TB) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}\)

\(\text{tr }(AB) = \text{tr }(BA) \)

cyclic permutation property

\(\text{tr }A =\sum_{i=1}^n \lambda_i \) where \(\lambda_1, \cdots, \lambda_n\) are the eigenvalues of \(A\)

For example, consider a quadratic form \(x^TAx \in \mathbb{R}\).
Let \(x\) be a vector in \(\mathbb{R}^n\) by Property 1 and 5, \[ \begin{align*} x^TAx = \text{tr }(x^TAx) &= \text{tr }(xx^TA) \\\\ &= (x^Tx)\text{tr }(A) \end{align*} \]
Property 4 is actually the definition of the inner product for matrices.
We call it the Frobenius inner product. For matrices \(A, B \in \mathbb{R}^{m \times n}\), \[ \left\langle A, B \right\rangle_F = \text{tr }(A^TB) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij} \] Also, like the Euclidean(\(l_2\)) norm \(\|v\|_2 = \sqrt{v \cdot v} = v^Tv\) of vector, we define the Frobenius norm of a "matrix" \(A\): \[ \| A \|_F = \sqrt{\left\langle A, A \right\rangle_F} = \sqrt{\text{tr }(A^TA)} \]

Norms

Norms play a key role in normalization and regularization. Depending on the application, we choose an appropriate norm. In machine learning, norms are fundational for training stability of models. Norms are used to scale data to a "common" range. It ensures that every single feature contributes "equally" to the training of the model and reduces sensitivity to the scale of each feature. This normalization process is important in the "distance(= norm)" based models such as support vector machines(SVMs) and \(k\)-nearest neighbor(k-NN). Also, regularization process uses norms to "penalize" the model complexity, which avoids overfitting and then helps the model generalize to new data.

Another popular norm of a matrix is the nuclear norm(trace norm).
Given a matrix \(A \in \mathbb{R}^{m \times n}\), then the nuclear norm of \(A\) is defined as the sum of its singular values. \[\| A\|_{\ast} = \sum_{i} \sigma_i \geq 0\]

Theorem 1: Let a matrix \(A = U\Sigma V^T\) by SVD. Then \[ \| A\|_{\ast} = \sum_{i} \sigma_i = \text{tr }(\Sigma) = \text{tr }(\sqrt{A^TA}) \].

Proof: Suppose \(A = U\Sigma V^T\) by the singular value decomposition. Since\(A^TA = (V\Sigma U^T)(U\Sigma V^T) = V\Sigma^2 V^T\), \[\sqrt{A^TA} = V\Sigma V^T\] Then by the cyclic permutation property of trace, \[ \begin{align*} \text{tr }(\sqrt{A^TA}) = \text{tr }(V\Sigma V^T) &= \text{tr }(\Sigma V^T V) \\\\ &= \text{tr }(\Sigma ) \end{align*} \]

Also, the induced norm of \(A\) is defined as \[\| A \|_p = \max_{x \neq 0} \frac{\| Ax \|_p}{\| x \|_p} = \max_{\| x \| =1} \| Ax \|_P \tag{1}\] Typically, when \(p =2\), we call it the spectral norm of \(A\), which is just the largest singular value of \(A\). \[\| A \|_2 = \max_{i} \sigma_i = \sqrt{\lambda_{\max} (A^TA)}\] where \(\lambda_{max}\) is the largest eigenvalue of \(A\).
The spectral norm is commonly used to control the Lipschiz continuity of functions in the neural network in order to prevent vanishing or exploding gradients.

\(\| x \|_p \) in (1) is called the \(p\)-norm of the vector \(x\). The Euclidean norm is a specific case of \(p\)-norm where \(p =2\). In general, \[\| x \|_p = (\sum_{i =1} ^n |x_i|^p)^{\frac{1}{p}} \quad , p \geq 1\] When \(p =1\), we call it Manhattan norm, which is commonly used in the grid-based environments. Also, if we need sparsity in models such as Lasso regression, the Manhattan norm is used as a regularizer.
Moreover, when \(p\) approachies \(\infty\), we define the maximum norm: \[\| x \|_\infty = \lim_{p \to \infty} \| x \|_p = \max_{i} |x_i|\] In numerical analysis, the maximum norm is used for "worst-case" error analysis. Similarly, in machine learning, it is used when we want to minimize the worst error rather than the total error across sample deta.

Intorduction to Metric Spaces

we learned about vector spaces and norms, which measure "distance" between points(vectors) in \(\mathbb{R}^n\). Although this concept may seem "general" as it extends beyond the familiar 3D space, the vector space with norms is actually a specific case of a more general mathematical structure called a metric space. As an introduction to mathematical analysis, we define the metric space \((M, d)\) as follows.

Metric Space \((M, d)\): Suppose \(M\) is a set and \(d\) is a real function defined on the Cartesian product \(M \times M\). Then \(d\) is called a distance function, or metric on \(M\) if and only if \(\forall a, b, c \in M\),

Positivity: \(d(a,b) \geq 0\) with equality iff \(a = b\)
Symmetry: \(d(a,b) = d(b,a)\)
The triangle inequality \(d(a,b) \leq d(a,c) + d(c, b)\)

Often we just say "the matric space \(M\)" if the metric \(d\) is obvious.
Note: Suppose \(X\) and \(Y\) are sets. Then Cartesian product of \(X\) and \(Y\) is a set of "ordered" pair \[\{(x, y) | x \in X, \, y\in Y\}\]
Now, you can see that a vector space with a norm is a metric space. We call it a normed vector space. Particularly, in linear algebra, we have worked within the framework of inner product spaces whose norm is defined by the inner product \[d(u, v) = \|u-v\|= \sqrt{\langle u-v, u-v \rangle} \] The most familiar examle of inner product space is the Euclidean space \(\mathbb{R}^n\). In machine learning, metric spaces are used to defince the distance, or similarity between data points.

Linear Algebra

Trace

Norms

Intorduction to Metric Spaces