Stochastic Matrix

A probability vector \(x \in \mathbb{R}^n\) is a vetor with nonnegative entries that add up to 1, and A stochastic matrix(or transition matrix) \(P \in \mathbb{R}^{n \times n}\) is a square matrix whose "columns" are probability vectors.

Note: In this page, we will use the column-stochastic matrix, but we can also use the row-stochastic matrix. Essentially, the choice between a row-stochastic and a column-stochastic matrix is a matter of convention and convenience, and both formulations are equivalent up to taking a transpose. Using the row-stochastic matrix is often natural in Markov chains because each row directly tells you the probabilities of transitioning from a given state to all other states. In linear algebra context, due to the form of \(Ax = b\), the column-stochastic matrix can be chosen more often. (Also, many programming environments default to column vectors.)

A Markov chain is a sequence of probability vectors \(x_0, x_1, x_2, \cdots \) together with a stochastic matrix \(P\) such that \[ x_1 = P x_0, \quad x_2 = P x_1, \quad x_3 = P x_2, \quad \cdots. \] So, the Markov chain is explained by the first-order difference equation: \[ x_{k+1} = P x_k \quad \text{for } k = 0, 1, 2, \cdots. \] Here, \(x_k\) is called a state vector and we have: \[ x_k = P^k x_0 \quad \text{for } k = 0, 1, 2, \cdots. \]

Example: Consider the following two states:

State 1: a student is sick
State 2: a student is not sick

We obserbed an initial state distribution: \[ x_0 = \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} \] which means now, 10 % of students are 90 % are not.

Moreover, we assume the following conditions:

70% of sick students recover the next day and 30 % remain sick.
5 % of not sick students become sick the next day and 95 % remain not sick

So, our stochastic matrix can be written as \[ P = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix}. \] Then, \[ x_1 = P x_0 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} = \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} \] This means that on the next day, approximately 7.5 % students are expected to be sick and 92.5 % are not.

We can keep going this process: \[ \begin{align*} &x_2 = P x_1 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} = \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} \\\\ &x_3 = P x_2 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} = \begin{bmatrix} 0.0671875 \\ 0.9328125 \end{bmatrix} \end{align*} \] and so on.

Steady-State Vector

A steady-state vector \(q\) for a stochastic matrix \(P\) is defined as \[ P q = q. \] In statistics, we also call it stationary distribution.

The Markov chain, a sequence of vectors \(\{x_k : k = 1, 2, \cdots\}\) converges to the unique steady-state vector \(q\) as \(k \to \infty\). Importantly, the initial state does not effect on the long-term behavior of the Markov chain.

In short, the reason why it happens is related to the fact that the steady-state vector \(q\) is an eigenvector of the stochastic matrix corresponding to the largest eigenvalue \(\lambda_1 = 1\). Let's dig a little deeper.

For any matrix \(A\), the spectral radius \(\rho(A)\) is defined by the maximum absolute value of its eigenvalues. It satisfies \[ \rho(A) \leq \| A \| \] for any submultiplicative matrix norm \(\| \cdot \| \).

Remember, by definition, every column of a column-stochastic matrix sums to 1, which means the maximum absolute column sum (1-norm, \(\| P \|_1\)) is always 1. Thus: \[ \rho(P) \leq \| P \|_1 = 1 \] (For the row-stochastic matrix, we use the infinity norm \(\| P \|_{\infty} =1\).)
Since we already know that 1 is an eigenvalue of \(P\), the spectral radius must be exactly 1, and no eigenvalue can have greater than 1. This means that as \(k \to \infty\), the contribution from all components other than the one corresponding to \(\lambda_1 =1\) decays exponentially fast. Therefore, any initial distribution converges to the stationary distribution.

Example: In our example, \[ \begin{align*} & P q = q \\\\ &\begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} q_1 \\ q_2 \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \end{bmatrix}\\\\ & q_1 + q_2 = 1 \\\\ &\Longrightarrow q = \begin{bmatrix} \frac{1}{15} \\ \frac{14}{15} \end{bmatrix} \end{align*} \] which means that in the long run, about 6.67 % of the students will be sick and about 93.33 % will be not sick.

Find eigenvalues and corresponding eigenvectors: \[ \begin{align*} &\det(P - \lambda I) = 0 \\\\ &\Longrightarrow \lambda^2 - 1.25\lambda + 0.25 = 0 \\\\ &\Longrightarrow (\lambda -1)(\lambda -0.25) = 0 \\\\ &\Longrightarrow \lambda_1 = 1, \quad \lambda_2 = 0.25. \end{align*} \] For \(\lambda_1 = 1\), solving \((P -I)v_1 = 0\), we obtain the corresponding eigenvector: \[ v_1 = \begin{bmatrix} 1 \\ 14 \end{bmatrix}. \] (Scaling by \(\frac{1}{15}\), we can get the stationary distribution \(q = \frac{1}{15}v_1\)).

For \(\lambda_2 = 0.25\), solving \((P -I)v_2 = 0\), we obtain the corresponding eigenvector: \[ v_2 = \begin{bmatrix} - 1 \\ 1 \end{bmatrix}. \] So, \[ V = \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix}, \quad V^{-1} = \frac{1}{15} \begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}, \] and \[ D = \begin{bmatrix} \lambda_1 & 0\\ 0 & \lambda_2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 0.25 \end{bmatrix} \] Thus, the transition matrix \(P\) after \(k\) steps can be written as:
\[ \begin{align*} P^k &= V D^k V^{-1} \\\\ &= \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix} \begin{bmatrix} 1^k & 0 \\ 0 & (0.25)^k \end{bmatrix} \frac{1}{15}\begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}. \end{align*} \]
The error in the state distribution after \(k\) steps is dominated by the term \(\lambda_2^k = (0.25)^k\). Thus, the convergence rate of the Markov chain toward the stationary distribution is exponential, with each additional step reducing the error roughly by a factor of \(0.25\):
\[ e_k \approx e_0 (0.25)^k \] where \(e_0\) is the initial error.

In this example, \(P\) is diagonalizable, but in general, the stochastic matrices are not always diagonalizable.

A matrix \(A\) is said to be doubly stochastic if the sum of each row and the sum of each column are 1. In this case, the matrix is not always diagonalizable, but in the case \(n =2\), it becomes a symmetric matrix and thus, it is diagonalizable. Consider the following doubly stochastic matrix for \(t \neq 0\): \[ A = \begin{bmatrix} 1 - t & t \\ t & 1 -t \end{bmatrix} \] Since the trace of \(A\) is \(2-2t\), and every stochastic matrix has an eigenvalue \(\lambda_1 = 1\) corresponding to an eigenvector \(v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\). \[ \lambda_1 + \lambda_2 = 1 + \lambda_2 = 2 - 2t \] Thus, \(\lambda_2 = 1 -2t\). Moreover, since the matrix is symmetric, and the two eigenvalues are distinct, their corresponding eigenvectors must be orthogonal each other. Therefore, we have \(v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}\) and \[ A = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1-2t \end{bmatrix} \frac{1}{-2}\begin{bmatrix} -1 & -1 \\ -1 & 1 \end{bmatrix}. \]

Stochastic Matrix

Linear Algebra

Stochastic Matrix

Steady-State Vector