Correlation

The magnitude of the covariance between two random variables \(X\) and \(Y\) depends on the units of the variables, we standardize the covariance by the standard deviation of the variables: \[ \begin{align*} \sigma_X &= \sqrt{\mathbb{E}[(X - \mu_X])^2]} \text{ and } \sigma_Y \\\\ &= \sqrt{\mathbb{E}[(Y - \mu_Y)^2]} \end{align*} \] Then we obtain the population correlation coefficient between \(X\) and \(Y\) quantifies the strength and direction of their linear relationship \(Y = aX +b\): \[ \begin{align*} \rho_{X, Y} = \text{Corr }[X, Y] &= \frac{\text{Cov }[X, Y]}{\sigma_X \sigma_Y} \\\\ &= \frac{\mathbb{E }[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}. \end{align*} \] Also, the sample correlation coefficient is defined as: \[ r_{xy} = \frac{1}{(n-1)s_x s_y}\sum_{i =1}^n (x_i - \bar{x})(y_i - \bar{y}) \] where \(\bar{x},\, \bar{y}\) are sample means and \(s_x \, , s_y\) are corrected sample standard deviations.
Note: the corrected sample standard deviation is an unbiased estimator of the population standard deviation: \[ s_x = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2}. \]

Theorem 1: Boundedness of Correlation Coefficient \[ -1 \leq \rho \leq 1 \] Note:
\(\rho = 1\) indicates a perfect positive linear relationship.
\(\rho = -1\) indicates a perfect negative linear relationship.
\(\rho = 0\) indicates no linear relationship.

Proof: We use the Cauchy-Schwarz inequality for random variables: \[ \mathbb{E}[XY]^2 \leq \mathbb{E}[X]^2 \mathbb{E}[Y]^2 \] Note: \(\mathbb{E}[XY]\) is the inner product on the set of random variables \(X\) and \(Y\).

Substitue standardized variables \[ \begin{align*} & \mathbb{E}[\frac{(X - \mathbb{E}[X])}{\sigma_X}\cdot \frac{(Y - \mathbb{E}[Y])}{\sigma_Y}]^2 \leq \mathbb{E}[\frac{(X - \mathbb{E}[X])}{\sigma_X}]^2 \, \mathbb{E}[\frac{(Y - \mathbb{E}[Y])}{\sigma_Y}]^2 \\\\ &\Longrightarrow \rho^2 \leq 1 \cdot 1 \\\\ &\Longrightarrow -1 \leq \rho \leq 1 \end{align*} \]

The correlation matrix is essentially a standardized version of the covariance matrix ensuring that each element is normalized to a value between \(-1\) and \(1\). This standardization is particularly useful in applications like Principal Component Analysis (PCA), where features with significantly different scales can lead to biased results. By standardizing, we ensure that all features are given fair weighting.

For a random vector \(x \in \mathbb{R}^n\), the correlation matrix is defined as: \[ \begin{align*} R &= \text{Corr }[x] \\\\ &= \begin{bmatrix} 1 & \frac{\mathbb{E }[(X_1-\mu_1)(X_2-\mu_2)]}{\sigma_1 \sigma_2} & \cdots & \frac{\mathbb{E }[(X_1-\mu_1)(X_n-\mu_n)]}{\sigma_1 \sigma_n} \\ \frac{\mathbb{E }[(X_2-\mu_2)(X_1-\mu_1)]}{\sigma_2 \sigma_1} & 1 & \cdots & \frac{\mathbb{E }[(X_2-\mu_2)(X_n-\mu_n)]}{\sigma_2 \sigma_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\mathbb{E }[(X_n-\mu_n)(X_1-\mu_1)]}{\sigma_n \sigma_1} & \frac{\mathbb{E }[(X_n-\mu_n)(X_2-\mu_2)]}{\sigma_n \sigma_2} & \cdots & 1 \end{bmatrix} \\\\ &= \begin{bmatrix} 1 & \text{Corr }[X_1, X_2] & \cdots & \text{Corr }[X_1, X_n] \\ \text{Corr }[X_2, X_1] & 1 & \cdots & \text{Corr }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Corr }[X_n, X_1] & \text{Corr }[X_n, X_2] & \cdots & 1 \end{bmatrix} \\\\ \end{align*} \] where \(\mu_i = \text{E }[X_i]\) is the mean and \(\sigma_i = \sqrt{\text{Var }(X_i)}\) is the standard deviation of \(X_i\).

The correlation matrix \(R\) can be derived from only the auto-covariance matrix \(K_{xx}\): \[ \begin{align*} K_{xx} &= \mathbb{E}[(x - \mathbb{E}[x])(x - \mathbb{E}[x])^T] \\\\ &= \begin{bmatrix} \text{Var }[X_1] & \text{Cov }[X_1, X_2] & \cdots & \text{Cov }[X_1, X_n] \\ \text{Cov }[X_2, X_1] & \text{Var }[X_2] & \cdots & \text{Cov }[X_2, X_n] \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov }[X_n, X_1] & \text{Cov }[X_n, X_2] & \cdots & \text{Var }[X_n] \end{bmatrix}. \end{align*} \] Here, we define a matrix \(\text{diag }(K_{xx})\), which is a diagonal matrix where each diagonal entry corresponds to the variance of a variable, and its inverse square root is given by: \[ (\text{diag }(K_{xx}))^{-\frac{1}{2}} = \begin{bmatrix} \frac{1}{\sqrt{\text{Var }[X_1]}} & 0 & \cdots & 0 \\ 0 & \frac{1}{\sqrt{\text{Var }[X_2]}} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{1}{\sqrt{\text{Var }[X_n]}} \end{bmatrix}. \] To standardize the auto-covariance matrix \(K_{xx}\) into the correlation matrix \(R\), each covariance entry \(\text{Cov }[X_i, X_j]\) is divided by the product of the standard deviations \(\sqrt{\text{Var }[X_i]} \cdot \sqrt{\text{Var }[X_j]}\). This task is achieved by: \[ R = (\text{diag }(K_{xx}))^{-\frac{1}{2}}K_{xx}(\text{diag }(K_{xx}))^{-\frac{1}{2}} \] This formulation normalizes all rows and columns simultaneously, by dividing each covariance by the corresponding standard deviations, and avoids computing each pairwise correlation explicitly.

Correlation

Probability & Statistics

Cross-Covariance

Correlation