Correlation

Cross-Covariance Correlation

Cross-Covariance

Often we want to handle multiple datasets or different sets of variables that are related but might not be directly comparable in terms of the data structures. In such a case, the cross-covariance measures the covariance between different pairs of variables from different datasets.

For example, consider two different datasets \(A \in \mathbb{R}^{m \times n_1}\) and \(B \in \mathbb{R}^{m \times n_2}\). We can compute the sample cross-covariance matrix: \[ K_{AB} = \frac{1}{m-1}(A-\bar{A})^T(B-\bar{B}) \] (Both \(A\) and \(B\) must have the same number of data points.)
Note: \(K_{AA}\) is called the auto-covariance matrix, which is the same as the covariance matrix that we discussed in Part 5.

Correlation

The magnitude of the covariance between two random variables \(X\) and \(Y\) depends on the units of the variables, we standardize the covariance by the standard deviation of the variables: \[ \begin{align*} \sigma_X &= \sqrt{\mathbb{E}[(X - \mu_X])^2]} \text{ and } \sigma_Y \\\\ &= \sqrt{\mathbb{E}[(Y - \mu_Y)^2]} \end{align*} \] Then we obtain the population correlation coefficient between \(X\) and \(Y\) quantifies the strength and direction of their linear relationship \(Y = aX +b\): \[ \begin{align*} \rho_{X, Y} = \text{Corr }[X, Y] &= \frac{\text{Cov }[X, Y]}{\sigma_X \sigma_Y} \\\\ &= \frac{\mathbb{E }[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}. \end{align*} \] Also, the sample correlation coefficient is defined as: \[ r_{xy} = \frac{1}{(n-1)s_x s_y}\sum_{i =1}^n (x_i - \bar{x})(y_i - \bar{y}) \] where \(\bar{x},\, \bar{y}\) are sample means and \(s_x \, , s_y\) are corrected sample standard deviations.
Note: the corrected sample standard deviation is an unbiased estimator of the population standard deviation: \[ s_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}. \]

Theorem 1: Boundedness of Correlation Coefficient \[ -1 \leq \rho \leq 1 \] Note:
\(\rho = 1\) indicates a perfect positive linear relationship.
\(\rho = -1\) indicates a perfect negative linear relationship.
\(\rho = 0\) indicates no linear relationship.
Proof: We use the Cauchy-Schwarz inequality for random variables: \[ \mathbb{E}[XY]^2 \leq \mathbb{E}[X]^2 \mathbb{E}[Y]^2 \] Note: \(\mathbb{E}[XY]\) is the inner product on the set of random variables \(X\) and \(Y\).

Substitue standardized variables \[ \begin{align*} & \mathbb{E}[\frac{(X - \mathbb{E}[X])}{\sigma_X}\cdot \frac{(Y - \mathbb{E}[Y])}{\sigma_Y}]^2 \leq \mathbb{E}[\frac{(X - \mathbb{E}[X])}{\sigma_X}]^2 \, \mathbb{E}[\frac{(Y - \mathbb{E}[Y])}{\sigma_Y}]^2 \\\\ &\Longrightarrow \rho^2 \leq 1 \cdot 1 \\\\ &\Longrightarrow -1 \leq \rho \leq 1 \end{align*} \]

The correlation matrix is essentially a standardized version of the covariance matrix ensuring that each element is normalized to a value between \(-1\) and \(1\). This standardization is particularly useful in applications like Principal Component Analysis (PCA), where features with significantly different scales can lead to biased results. By standardizing, we ensure that all features are given fair weighting.

For a random vector \(x \in \mathbb{R}^n\), the correlation matrix is defined as: \[ \begin{align*} R &= \text{Corr }[x] \\\\ &= \begin{bmatrix} 1 & \frac{\mathbb{E }[(X_1-\mu_1)(X_2-\mu_2)]}{\sigma_1 \sigma_2} & \cdots & \frac{\mathbb{E }[(X_1-\mu_1)(X_n-\mu_n)]}{\sigma_1 \sigma_n} \\ \frac{\mathbb{E }[(X_2-\mu_2)(X_1-\mu_1)]}{\sigma_2 \sigma_1} & 1 & \cdots & \frac{\mathbb{E }[(X_2-\mu_2)(X_n-\mu_n)]}{\sigma_2 \sigma_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\mathbb{E }[(X_n-\mu_n)(X_1-\mu_1)]}{\sigma_n \sigma_1} & \frac{\mathbb{E }[(X_n-\mu_n)(X_2-\mu_2)]}{\sigma_n \sigma_2} & \cdots & 1 \end{bmatrix} \\\\ &= \begin{bmatrix} 1 & \text{Corr }[X_1, X_2] & \cdots & \text{Corr }[X_1, X_n] \\ \text{Corr }[X_2, X_1] & 1 & \cdots & \text{Corr }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Corr }[X_n, X_1] & \text{Corr }[X_n, X_2] & \cdots & 1 \end{bmatrix} \\\\ \end{align*} \] where \(\mu_i = \text{E }[X_i]\) is the mean and \(\sigma_i = \sqrt{\text{Var }(X_i)}\) is the standard deviation of \(X_i\).

The correlation matrix \(R\) can be derived from only the auto-covariance matrix \(K_{xx}\): \[ \begin{align*} K_{xx} &= \mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T] \\\\ &= \begin{bmatrix} \text{Var }[X_1] & \text{Cov }[X_1, X_2] & \cdots & \text{Cov }[X_1, X_n] \\ \text{Cov }[X_2, X_1] & \text{Var }[X_2] & \cdots & \text{Cov }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov }[X_n, X_1] & \text{Cov }[X_n, X_2] & \cdots & \text{Var }[X_n] \end{bmatrix}. \end{align*} \] Here, we define a matrix \(\text{diag }(K_{xx})\), which is a diagonal matrix where each diagonal entry corresponds to the variance of a variable, and its inverse square root is given by: \[ (\text{diag }(K_{xx}))^{-\frac{1}{2}} = \begin{bmatrix} \frac{1}{\sqrt{\text{Var }[X_1]}} & 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{\text{Var }[X_2]}} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{\text{Var }[X_n]}} \end{bmatrix}. \] To standardize the auto-covariance matrix \(K_{xx}\) into the correlation matrix \(R\), each covariance entry \(\text{Cov }[X_i, X_j]\) is divided by the product of the standard deviations \(\sqrt{\text{Var }[X_i]} \cdot \sqrt{\text{Var }[X_j]}\). This task is achieved by: \[ R = (\text{diag }(K_{xx}))^{-\frac{1}{2}}K_{xx}(\text{diag }(K_{xx}))^{-\frac{1}{2}} \] This formulation normalizes all rows and columns simultaneously, by dividing each covariance by the corresponding standard deviations, and avoids computing each pairwise correlation explicitly.