Trace
The trace of an \(n \times n\) matrix \(A\) is the sum of the diagonal entries of \(A\) and
denoted by \(\text{tr }(A)\).
\[
\text{tr }(A) =\sum_{k=1}^n a_{kk}
\]
We list some properties of the trace.
- \(\text{tr }(cA) = c (\text{tr }A) \quad , c \in \mathbb{R}\)
- \(\text{tr }(A+B) = \text{tr }A + \text{tr }B \quad , B \in \mathbb{R}^{n \times n}\)
- \(\text{tr }A^T = \text{tr }A \)
- \(\text{tr }(A^TB) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}\)
- \(\text{tr }(AB) = \text{tr }(BA) \)
Note: \(A\) can be \(m \times n\) and \(B\) can be \(n \times m\). More generally,
\(\text{tr }(ABC) = \text{tr }(CAB) = \text{tr }(BCA) \): cyclic permutation property
- \(\text{tr }A =\sum_{i=1}^n \lambda_i \) where \(\lambda_1, \cdots, \lambda_n\) are the eigenvalues of \(A\)
For example, consider a quadratic form \(x^TAx \in \mathbb{R}\).
Let \(x\) be a vector in \(\mathbb{R}^n\) by Property 1 and 5,
\[
\begin{align*}
x^TAx = \text{tr }(x^TAx) &= \text{tr }(xx^TA) \\\\
&= (x^Tx)\text{tr }(A)
\end{align*}
\]
Property 4 is actually the definition of the inner product for matrices.
We call it the Frobenius inner product. For matrices \(A, B \in \mathbb{R}^{m \times n}\),
\[
\left\langle A, B \right\rangle_F = \text{tr }(A^TB) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}
\]
Also, like the Euclidean(\(l_2\)) norm \(\|v\|_2 = \sqrt{v \cdot v} = v^Tv\) of vector, we define
the Frobenius norm of a "matrix" \(A\):
\[
\| A \|_F = \sqrt{\left\langle A, A \right\rangle_F} = \sqrt{\text{tr }(A^TA)}
\]
Norms
Norms play a key role in normalization and regularization. Depending on the
application, we choose an appropriate norm. In machine learning, norms are fundational for training stability of models.
Norms are used to scale data to a "common" range. It ensures that every single feature contributes
"equally" to the training of the model and reduces sensitivity to the scale of each feature. This
normalization process is important in the "distance(= norm)" based models such as
support vector machines(SVMs) and \(k\)-nearest neighbor(k-NN).
Also, regularization process uses norms to "penalize" the model complexity, which avoids
overfitting and then helps the model generalize to new data.
Another popular norm of a matrix is the nuclear norm(trace norm).
Given a matrix \(A \in \mathbb{R}^{m \times n}\), then the nuclear norm of \(A\) is defined as the sum of its
singular values.
\[\| A\|_{\ast} = \sum_{i} \sigma_i \geq 0\]
Theorem 1:
Let a matrix \(A = U\Sigma V^T\) by SVD. Then
\[
\| A\|_{\ast} = \sum_{i} \sigma_i = \text{tr }(\Sigma) = \text{tr }(\sqrt{A^TA})
\].
Proof:
Suppose \(A = U\Sigma V^T\) by the singular value decomposition.
Since\(A^TA = (V\Sigma U^T)(U\Sigma V^T) = V\Sigma^2 V^T\),
\[\sqrt{A^TA} = V\Sigma V^T\]
Then by the cyclic permutation property of trace,
\[
\begin{align*}
\text{tr }(\sqrt{A^TA}) = \text{tr }(V\Sigma V^T) &= \text{tr }(\Sigma V^T V) \\\\
&= \text{tr }(\Sigma )
\end{align*}
\]
Also, the induced norm of \(A\) is defined as
\[\| A \|_p = \max_{x \neq 0} \frac{\| Ax \|_p}{\| x \|_p} = \max_{\| x \| =1} \| Ax \|_P \tag{1}\]
Typically, when \(p =2\), we call it the spectral norm of \(A\), which is just
the largest singular value of \(A\).
\[\| A \|_2 = \max_{i} \sigma_i = \sqrt{\lambda_{\max} (A^TA)}\]
where \(\lambda_{max}\) is the largest eigenvalue of \(A\).
The spectral norm is commonly used to control the Lipschiz continuity of
functions in the neural network in order to prevent vanishing or exploding
gradients.
\(\| x \|_p \) in (1) is called the \(p\)-norm of the vector \(x\). The Euclidean norm is
a specific case of \(p\)-norm where \(p =2\). In general,
\[\| x \|_p = (\sum_{i =1} ^n |x_i|^p)^{\frac{1}{p}} \quad , p \geq 1\]
When \(p =1\), we call it Manhattan norm, which is commonly used in the grid-based
environments. Also, if we need sparsity in models such as Lasso regression, the Manhattan
norm is used as a regularizer.
Moreover, when \(p\) approachies \(\infty\), we define the maximum norm:
\[\| x \|_\infty = \lim_{p \to \infty} \| x \|_p = \max_{i} |x_i|\]
In numerical analysis, the maximum norm is used for "worst-case" error analysis. Similarly,
in machine learning, it is used when we want to minimize the worst error rather than the total
error across sample deta.
Intorduction to Metric Spaces
we learned about vector spaces and
norms, which measure "distance" between points(vectors) in \(\mathbb{R}^n\). Although this concept may seem
"general" as it extends beyond the familiar 3D space, the vector space with norms is actually a specific
case of a more general mathematical structure called a metric space. As an introduction to
mathematical analysis, we define the metric space \((M, d)\) as follows.
Metric Space \((M, d)\):
Suppose \(M\) is a set and \(d\) is a real function defined on the Cartesian product
\(M \times M\). Then \(d\) is called a distance function, or
metric on \(M\) if and only if
\(\forall a, b, c \in M\),
- Positivity: \(d(a,b) \geq 0\) with equality iff \(a = b\)
- Symmetry: \(d(a,b) = d(b,a)\)
- The triangle inequality \(d(a,b) \leq d(a,c) + d(c, b)\)
Often we just say "the matric space \(M\)" if the metric \(d\) is obvious.
Note: Suppose \(X\) and \(Y\) are sets. Then Cartesian product of \(X\) and \(Y\)
is a set of "ordered" pair
\[\{(x, y) | x \in X, \, y\in Y\}\]
Now, you can see that a vector space with a norm is a metric space. We call it
a normed vector space. Particularly, in linear algebra, we have worked within
the framework of inner product spaces whose norm is defined by the inner product
\[d(u, v) = \|u-v\|= \sqrt{\langle u-v, u-v \rangle} \]
The most familiar examle of inner product space is the Euclidean space \(\mathbb{R}^n\).
In machine learning, metric spaces are used to defince the distance, or similarity between
data points.