Covariance - MATH-CS COMPASS

Covariance Matrix

From this section, we consider multivariate models. First of all, we need a way to measure the dependence of variables each other.

The covariance between two random variables \(X\) and \(Y\) is defined as \[ \text{Cov }[X, Y] = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]. \] Here, \((X - \mathbb{E}[X])\) and \((Y - \mathbb{E}[Y])\) are mean deviations which represent how far the random variables \(X\) and \(Y\) deviate from their respective expected values(means). So, the covariance measures how these deviation of random variables vary together(joint variation).

Note: \(\text{Cov }[X, Y] = \mathbb{E}[XY]- \mathbb{E}[X]\mathbb{E}[Y]\). So, if \(X\) and \(Y\) are independent, \(\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]\) and thus \(\text{Cov }[X, Y] = 0\). However, in general, the converse is NOT true. Zero covariance only tells us there is no linear dependence. The positive(negative) covariance indicates positive(negative) linear association.

In practice, often we extend this idea to \(n\) random variables. Consider a vector \(x \in \mathbb{R}^n\) whose each entries represent random variables \(X_1, \cdots X_n\).
The population covariance matrix of the vector \(x\) is defined as \[ \begin{align*} \Sigma &= \text{Cov }[x] \\\\ &= \mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T] \\\\ &= \begin{bmatrix} \text{Var }[X_1] & \text{Cov }[X_1, X_2] & \cdots & \text{Cov }[X_1, X_n] \\ \text{Cov }[X_2, X_1] & \text{Var }[X_2] & \cdots & \text{Cov }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov }[X_n, X_1] & \text{Cov }[X_n, X_2] & \cdots & \text{Var }[X_n] \end{bmatrix} \\\\ &= \mathbb{E }[xx^T] - \mu \mu^T \end{align*} \] Each diagonal entry \(\Sigma_{ii}\) represents the variance of \(X_i\) because \[ \text{Cov }[X_i, X_i] = \mathbb{E}[(X_i - \mathbb{E}[X_i])^2] = \sigma^2 \] Also, the total variance of \(\Sigma \) is defined as the trace of \(\Sigma \): \[ \text{tr } (\Sigma ) = \sum_{i=1}^n \text{Var }[X_i]. \]
The covariance matrix is symmetric because the covariance itself is symmetric: \[ \Sigma_{ij} = \text{Cov }[X_i, X_j] = \text{Cov }[X_j, X_i] = \Sigma_{ji}. \] (Or, for any matrix \(A\), \(\, AA^T\) is symmetric because \((AA^T)^T = (A^T)^TA^T = AA^T \).)

Since \(\Sigma\) is symmetric, it is always orthogonally diagonalizable: \[ \Sigma = P D P^T \] where \(P\) is an orthogonal matrix whose columns are unit eigenvectors of \(\Sigma\) and \(D\) is a diagonal matrix whose diagonal entries are eigenvalues of \(\Sigma\) corresponding to its eigenvectors.

Note: The total variance of \(\Sigma\) is equal to the sum of its eigenvalues.(ONLY the total!) \[ \text{tr } (\Sigma ) = \sum_{i=1}^n \text{Var }[X_i] = \text{tr } (D) = \sum_{i=1}^n \lambda_i \]
Moreover, the covariance matrix is always positive semi-definite:
For any vector \(v \in \mathbb{R}^n\), \[ \begin{align*} v^T \Sigma v &= v^T\mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T] v \\\\ &= \mathbb{E}[v^T(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T v] \\\\ &= \mathbb{E}[(v^T(x - \mathbb{E}[x]))^2] \geq 0 \end{align*} \] Thus, the diagonal entries of \(D\) are non-negative. In other words, the eigenvalues of \(\, \Sigma\) always satisfies: \[ \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq 0. \]

SideNote: \(\mathbb{E }[xx^T]\) is called the autocorrelation matrix denoted as \(R_{xx}\): \[ \begin{align*} R_{xx} &= \mathbb{E }[xx^T] \\\\ &= \begin{bmatrix} \mathbb{E }[X_1^2] & \mathbb{E }[X_1 X_2] & \cdots & \mathbb{E }[X_1 X_n] \\ \mathbb{E }[X_2 X_1] & \mathbb{E }[X_2^2] & \cdots & \mathbb{E }[X_2 X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \mathbb{E }[X_n X_1] & \mathbb{E }[X_n X_2] & \cdots & \mathbb{E }[X_n^2] \end{bmatrix} \\\\ \end{align*} \]

Principal Component Analysis (PCA)

The orthogonal diagonalization of the covariance matrix \(\Sigma\) is important in statistics and machine learning for analyzing data and reducing dimensionality: \[ \Sigma = P D P^T \] Here, the column vectors of \(P\) (unit eigenvectors of \(\Sigma \)) are called the principal components(PCs) of the data, and the diagonal entries of \(D\) (eigenvalues of \(\Sigma \)) represent the variances along these principal components. Each eigenvalue indicates "how much" of the total variance of \(\Sigma \) is captured by its corresponding principal component.

The first principal component(PC1) corresponds to the largest eigenvalue and represents the direction in the data space where the distribution varies the most(= the direction of maximizing the variance in data). The second principal component(PC2) captures the next largest variance and is orthogonal to PC1. Similarly, each subsequent PC capturing less variance and maintaining orthogonality to all previous PCs.

For example, suppose you have an observed data matrix \(X \in \mathbb{R}^{m \times 10}\), which contains \(m\) data points, each represented as a vector \(x_i \in \mathbb{R}^{10} \). Computing the sample covariance matrix and diagonalizing it: \[ S = \frac{1}{m-1}(X-\bar{X})^T (X-\bar{X}) = PDP^T \] you examine the eigenvalues to determine how much variance each principal component captures comparing with the total variance of \(S\).
Let's say PC1: 40%, PC2: 30%, and PC3: 25%. Then, you project the data onto the subspace spanned by the first 3 most significant principal components. Even you discarded the remaining 7 dimentions(noise dimensions), you still retain the most significant patterns(trends) in the data. \[ x_i \in \mathbb{R}^{10} \to z_i \in \mathbb{R}^3 \] The vector \(z_i\) is known as the latent vector.

This dimensionality reduction process is called principal component analysis(PCA). By reducing dimensions, PCA enables efficient analysis of large datasets while retaining their most meaningful structure. PCA is widely used in machine learning, image processing, and exploratory data analysis for dimensionality reduction and noise filtering.

PCA with Singular Value Decomposition(SVD)

In practice, the singular value decomposition plays an important role in the PCA due to its mathematical properties and computational advantages.

Given a data matrix in mean-deviation form \(X \in \mathbb{R}^{m \times n}\) where \(m\) is the number of data points and \(n\) is the number of features, the matrix \(X^TX\) is the covariance matrix.

Note: in PCA, it is common to use \(X^TX \in \mathbb{R}^{n \times n}\) rather than \(XX^T \in \mathbb{R}^{m \times m}\) because we are typically more interested in understanding the relationships among the \(n\) "features" than among \(m\) data points.

Using the SVD of \(X\), we compute the covariance matrix \(X^TX\) : \[ \begin{align*} X^TX &= (U\Sigma V^T)^T(U\Sigma V^T) \\\\ &= V\Sigma^TU^TU\Sigma V^T \\\\ &= V\Sigma^2V^T \end{align*} \] This result is equivalent to the eigendecomposition \(X^TX = PDP^T\). \(\Sigma^2\) contains the eigenvalues \(\lambda_i\) of \(X^TX\), with \(\sigma_i ^2 = \lambda_i\). Also, \(V\) (right singular vectors) corresponds to the eigenvectors (principal components) of \(X^TX\).
Thus, the singular values \(\sigma_i\) \(\sigma_i\) the standard deviations along the principal components, and the squared singular values \(\sigma_i^2\) represent the variances.

SVD offers several numerical advantages for PCA. For high-dimensional data, computing \(X^TX\) would be expensive or memory-intensive, but SVD can avoid explicitly forming \(X^TX\) working directly with \(X\), which allows direct access to the principal components and variances.

                    import numpy as np

                    # Random data matrix ( m data points with n features)
                    def generate_data(m, n):
                        data = np.random.randn(m, n) 

                        # Make some correlations 
                        data[:, 2] = 0.7 * data[:, 0] + 0.5 * data[:, 2]
                        data[:, 3] = 0.2 * data[:, 0] + 0.5 * data[:, 1] 
                        data[:, 4] = -0.3 * data[:, 1] + 0.2 * data[:, 2]
                        data[:, 5] = 0.4 * data[:, 0] + 0.1 * data[:, 1]
                        data[:, 6] = 0.8 * data[:, 3] + -0.3 * data[:, 2]

                        # Mean-deviation form 
                        return (data - np.mean(data, axis=0) )

                    # PCA with covariance matrix (As refference)
                    def pca_via_covariance(data):
                        
                        # "Sample" covariance matrix 
                        # Note: Dividing by (m-1) provides "unbiased" estimate of the population covariance. 
                        cov_matrix = np.dot(data.T, data) / (data.shape[0] - 1)

                        # Eigendecomposition 
                        # np.linalg.eigh() is for symmetric matrix (real, diagonaizable), better than np.linalg.eig() 
                        eigvals, eigvecs = np.linalg.eigh(cov_matrix) 

                        #  Make eigenvalues & eigenvectors in descending order
                        idx = np.argsort(eigvals)[::-1] 
                        eigvals = eigvals[idx]
                        eigvecs = eigvecs[:, idx] 

                        # Each variance vs total variance
                        ratio = eigvals / np.sum(eigvals)

                        return eigvals, eigvecs, ratio

                    # PCA with SVD (we use this function)
                    def pca_with_svd(data):
                        
                        # Singular Value Decomposition (we don't need the matrix U: use "_" )
                        _, S, vt = np.linalg.svd(data, full_matrices=False)
                        
                        # Get eigenvalues via singular values: lambda_i = (S_i)^2  / m - 1
                        eigvals = (S ** 2) / (data.shape[0] - 1)
                        
                        # Each variance vs total variance
                        ratio = eigvals / np.sum(eigvals)
                        
                        return eigvals, vt.T, ratio 

                    # Set 90% threshold for the number of PCs
                    def threshold(var_ratio, t):
                        
                        # Compute cumulative variance
                        cum_variance = np.cumsum(var_ratio)
                        
                        # Find the number of components for 90% variance retention
                        num_pcs = np.argmax(cum_variance >= t) + 1
                        return num_pcs

                    # Dimension of data 
                    m = 1000000 
                    n = 10 

                    # Run PCA
                    eigvals, eigvecs, ratio = pca_with_svd(generate_data(m, n))

                    # Threshold for variance retention
                    t = 0.95
                    num_pcs = threshold(ratio, t)

                    # Print results
                    print("Eigenvalues:")
                    for i, val in enumerate(eigvals):
                        print(f"  Lambda {i + 1}: {val:.6f}")

                    print("\nExplained Variance Ratio (%):")
                    for i, var in enumerate(ratio):
                        print(f"  PC{i + 1}: {var * 100:.2f}%")

                    print(f"\nTo retain {t * 100:.0f}% of variance, use the first {num_pcs} PCs.")