Markov Chains

Probabilistic Graphical Models

A graphical model is a graph-based representation of a set of variables and their relationships (typically dependencies or interactions). The graph helps visualize and reason about the structure of a system composed of many interrelated parts. Here, we focus on a specific graphical model as known as a probabilistic graphical models(PGMs) whose the vertices represent random variables, and the edges represent probabilistic dependencies among these variables. In the case where the graph is a directed acyclic graph (DAG), the model is often called a Bayesian network. An important assumption is is that each vertex(r.v.) \(X_i\) is conditionally independent of its non-descendants given its parents. Thus, the joint distribution is given by: \[ P(X_1, X_2, \cdots, X_n) = \prod_{i = 1}^n P(X_i | \text{ Parents(X_i)}) \] where \(n\) is the number of random variables(vertices) in the model(graph). The vertices are in topological order: For every directed edge \(u \to v\), vertex \(u\) comes before vertex \(v\). Each term \(P(X_i | \text{ Parents}(X_i))\) is the conditional probability distribution for vertex \(i\).

Language Models

For modeling sequential data, we introduce a linear-chain Bayesian network with a particular type of conditional independence. This is called a Markov chain. The concept of the Markov chain is fundamental in various fields. while the core idea remains the same, its applications and methods of analysis can differ depending on the field. Here, we'll explore a specific application in the context of language modeling.

Suppose our goal is to model a joint probability distribution over variable-length sequences: \(P(y_{1:T})\), where each \(y_t \in \{1, \cdots, K\}\) represents a word from a vocabulary of size \(K\). A language model assigns probabilities to possible sentences (sequences) of length \(T\).

By the chain rule of probability, \[ \begin{align*} P(y_{1:T}) = P(y_1)P(y_2 | y_1)P(y_3 | y_2, y_1)\cdots \\\\ = \prod_{t = 1}^{T} P(y_t | y_{t_1 : t-1}). \end{align*} \] However, as \(T\) increases, this formulation becomes computationally expensive due to the need to condition on the entire history. To address this, we make the Markov assumption: the next state depends "only" on the current state.

Under the first-order Markov assumption, \[ \begin{align*} P(y_{1:T}) &= P(y_1)P(y_2 | y_1)P(y_3 | y_2)P(y_4 | y_3)\cdots \\\\ &= P(y_1)\prod_{t = 2}^{T} P(y_t | y_{t-1}). \end{align*} \] This memoryless property simplifies both analysis and computation of probabilities. The function \(P(y_t | y_{t-1})\) is called the transition function or transition kernel, and it satisfies:

\(P(y_t | y_{t-1}) \geq 0\)
\(\sum_{k=1}^K P(y_t = k | y_{t-1} = j) = 1\) for each \(j\)

We can represent the transition probabilities using a stochastic matrix: \[ A_{jk} = P(y_t = k | y_{t-1} = j), \] where each row of \(A\) sums to 1. This matrix is considered as the conditional probability Table (CPT). Since we assume the same transition probabilities at all time steps, the model is said to be time-invariant.

The Markov assumption can be extended to consider the last \(M\) states( or memory length): \[ P(y_{1:T}) = P(y_{1 : M}) \prod_{t = M + 1}^T P(y_t | y_{t - M : t - 1}). \] This is known as an M'th order Markov model. In language modeling, this is equivalent to an \(M+1\)-gram model. For example, if \(M = 2\), each word depends on the two preceding words, leading to a trigram model: \[ P(y_t | y_{t-1}, y_{y-2}). \]
Any Higher-order Markov model can be converted into the first-order Markov model by redefining the state to include the past \(M\) observations. For \(M = 2\), define \(\tilde{y}_t = (y_{t-1}, y_t)\), then: \[ \begin{align*} P(\tilde{y}_{1:T}) &= P(\tilde{y}_2) \prod_{t = 3}^T P(\tilde{y}_t | \tilde{y}_{t-1}) \\\\ &= P(y_1, y_2) \prod_{t = 3}^T P(y_t | y_{t-1}, y_{t-2}). \end{align*} \]
In practice, with large vocabularies, modeling all possible transitions becomes infeasible, and we need additional techniques such as neural language models to approximate the distribution efficiently. Not only language modeling, Markov chains are widely used in sequential data modeling. Additionally, Markov Chain Monte Carlo (MCMC) methods are crucial in Bayesian statistics for performing approximate inference when direct sampling is challenging.

Parameter Estimation of Markov Models

The probability of any particular sequence of length \(T\) is given by \[ \begin{align*} P(x_{1:T} | \boldsymbol\theta) &= \pi (x_1)A(x_1, x_2)\cdots A(x_{T-1}, x_T) \\\\ &= \prod_{j=1}^K (\pi_j)^{\mathbb{I}(x_1 =j)} \prod_{t=2}^T \prod_{j=1}^K \prod_{k=1}^K (A_{jk})^{\mathbb{I}(x_t=k,\, x_{t-1}=j)}, \tag{1} \end{align*} \] where \(\pi_j\) is probabilty that the first symbol is \(j\), and \(A_{jk}\) is probability of going from symbol \(j\) to symbol \(k\). This is the transition probability matrix. In addition, \(\mathbb{I}(\cdot)\) is the indicator function. For example, \[ \mathbb{I}(x_1 =j) = \begin{cases} 1 &\text{if \(x_1 = j\)} \\ 0 &\text{otherwise} \end{cases}. \] This lets us convert sums and products into counts. In Equation 1, only one transition happens at a time, so only one term contributes in each time step.

The log-likelihood of a set of sequences \(\mathcal{D} = (x_1, \cdots, x_N)\), where \(x_i = (x_{i\,1}, \cdots, x_{i \, T_{i}})\) is a sequence of length \(T_i\) is given by \[ \begin{align*} \log P(\mathcal{D} | \boldsymbol\theta) &= \sum_{i=1}^N \log P(x_i | \boldsymbol\theta) \\\\ &= \sum_j N_j^1 \log \pi_i + \sum_j \sum_k N_{jk} \log A_{jk}. \end{align*} \] Define the following counts:

How often symbol \(j\) is seen at the "start" of a sequence:
How often symbol \(j\) is seen "anywhere" in a sequence:
Count of transitions from \(j\) to \(k\):

We want to obtain \(\hat{\pi}_j\) and \( \hat{A}_{jk}\) that are MLE under the constraints: \[ \begin{align*} &\sum_j \pi_j = 1 \\\\ &\sum_k A_{jk} = 1 \text{ for each } j \end{align*} \] For \(\pi_i\), we introduce Lagrange multiplier \(\lambda\) and define \[ \mathcal{L}_{\pi} = \sum_j N_j^1 \log \pi_i + \lambda \left( 1 - \sum_j \pi_j \right). \] Then \[ \frac{\partial \mathcal{L}_{\pi}}{\partial \pi_j} = \frac{ N_j^1}{\pi_j} - \lambda = 0 \Longrightarrow \pi_j = \frac{N_j^1}{\lambda}. \] Plug into the constraint \(\sum_j \pi_j = 1\), we have \[ \sum_j \frac{N_j^1}{\lambda} = 1 \Longrightarrow \lambda = \sum_j N_j^1. \] Thus, \[ \hat{\pi}_j = \frac{N_j^1}{\sum_{j^{\prime}} {N_{j^{\prime}}^1}}. \] For \(A_{jk}\), we introduce Lagrange multiplier \(\mu\) and define \[ \mathcal{L_A} = \sum_j \sum_k N_{jk} \log A_{jk} + \sum_j \mu_j \left(1 - \sum_k A_{jk} \right). \] Then \[ \frac{\partial \mathcal{L}_{A}}{\partial A_{jk}} = \frac{ N_{jk}}{A_{jk}} - \mu_j = 0 \Longrightarrow A_{jk} = \frac{N_{jk}}{\mu_j}. \] Plug into the constraint \(\sum_k A_{jk} = 1 \text{ for each } j\), we have \[ \sum_k \frac{N_{jk}}{\mu_j} = 1 \Longrightarrow \mu_j = \sum_k N_{jk} = N_j. \] Thus, \[ \hat{A}_{jk} = \frac{N_{jk}}{N_j}. \]

A Code Sample (MLE in Markov Models)

Let's pretend we observed 14 days of weather in 10 cities (i.e., 10 sequences, each of length 14). We compute MLE estimates for start probabilities \(\pi\) and transition matrix \(A\).
(Note: You can use pandas DataFrame tables for printing results.)

      
                            import numpy as np
                            import random
                            
                            # --- Constants ---
                            # Define states 
                            STATES = ['Sunny', 'Rainy', 'Cloudy', 'Stormy', 'Foggy']
                            STATE_TO_INDEX = {state: i for i, state in enumerate(STATES)}
                            INDEX_TO_STATE = {i: state for i, state in enumerate(STATES)}
                            # Observed 14 days of weather in 10 cities (i.e., 10 sequences, each of length 14).
                            DAYS = 14
                            CITIES = 10
                            
                            # Define the true transition matrix (for data generation only)
                            TRUE_TRANSITION_MATRIX = np.array([
                                [0.5, 0.2, 0.2, 0.05, 0.05],   # From Sunny
                                [0.3, 0.4, 0.2, 0.1, 0.0],     # From Rainy
                                [0.4, 0.2, 0.3, 0.05, 0.05],   # From Cloudy
                                [0.1, 0.4, 0.2, 0.3, 0.0],     # From Stormy
                                [0.3, 0.1, 0.4, 0.0, 0.2],     # From Foggy
                            ])
                            
                            # --- Sequence Generation ---
                            # Generate a sequence of weather states
                            def generate_sequence(length, transition_matrix, states, state_to_index, start_state=None):
                                if start_state is None:
                                    start_state = random.choice(states)
                                seq = [start_state]
                                for _ in range(length - 1):
                                    current_index = state_to_index[seq[-1]]
                                    next_state = np.random.choice(states, p=transition_matrix[current_index])
                                    seq.append(next_state)
                                return seq
                            
                            # --- MLE Estimation ---
                            def estimate_mle(sequences, states, state_to_index):
                                num_states = len(states)
                                N1 = np.zeros(num_states) # Start state counts
                                N_jk = np.zeros((num_states, num_states)) # Transition counts
                            
                                for seq in sequences:
                                    first_idx = state_to_index[seq[0]]
                                    N1[first_idx] += 1
                                    for t in range(len(seq) - 1):
                                        j = state_to_index[seq[t]]
                                        k = state_to_index[seq[t + 1]]
                                        N_jk[j, k] += 1
                            
                                pi_hat = N1 / np.sum(N1) # Estimate start probabilities π̂
                                row_sums = np.sum(N_jk, axis=1, keepdims=True)
                                A_hat = np.divide(N_jk, row_sums, out=np.zeros_like(N_jk), where=row_sums != 0) # Estimate transition matrix Â
                            
                                return pi_hat, A_hat
                            
                            # --- Display Functions ---
                            def print_sequences(sequences):
                                print("Sequences:")
                                for i, seq in enumerate(sequences):
                                    print(f"City {i+1}:\n {' → '.join(seq)}")
                            
                            def print_start_probabilities(pi_hat, states):
                                print("\nEstimated Start Probabilities (π̂ ):")
                                for state, prob in zip(states, pi_hat):
                                    print(f"{state:>6}: {prob:.3f}")
                            
                            def print_transition_matrix(A_hat, states):
                                # Determine column width based on the longest state name + padding
                                max_len = max(len(state) for state in states)
                                col_width = max_len + 2
                            
                                # Create header dynamically with calculated column width
                                header = "From \\ To".rjust(col_width) + " | " + " | ".join(f"{s:^{col_width}}" for s in states)
                                print(header)
                                print("-" * len(header))
                                # Print each row, formatting numbers with dynamic width
                                for j, row in enumerate(A_hat):
                                    row_str = " | ".join(f"{p:{col_width}.3f}" for p in row)
                                    print(f"{states[j]:>{col_width}} | {row_str}")
                            
                            if __name__ == "__main__":
                                sequences = [generate_sequence(DAYS, TRUE_TRANSITION_MATRIX, STATES, STATE_TO_INDEX) for _ in range(CITIES)]
                                pi_hat, A_hat = estimate_mle(sequences, STATES, STATE_TO_INDEX)
                                print_sequences(sequences)
                                print_start_probabilities(pi_hat, STATES)
                                print("\nEstimated Transition Matrix (Â):")
                                print_transition_matrix(A_hat, STATES)
                                print("\nTrue Transition Matrix (A):")
                                print_transition_matrix(TRUE_TRANSITION_MATRIX, STATES)

Sparse Data & Dirichlet Prior

When we have a limited number of sequences (or sequence steps), many possible transitions might never be observed at all, or be observed just once or twice. This situation leads to having fewer observations relative to the number of parameters (transitions). The sparse data problem in Markov chain models is a key issue, especially when working with many possible states. The sparse data lead to unreliable MLE estimates (overfitting, zero estimates, biased predictions, inaccurate long-term behavior). To address these issues, we often need smoothing or Bayesian approaches to generalize better. However, only smooth what we should — don't violate known structure (like true zeros).

A simple smoothing technique is add-one smoothing is a method used to avoid zero estimates in cases where some transitions are never observed. In the Bayesian framework, add-one smoothing is equivalent to using a Dirichlet prior with \(\alpha_j = 1 , \forall j\).

Uniform Dirichlet Prior The \(j\)-th row of the transition matrix \(A\) (which represents the probabilities of transitioning from state \(j\) to each of the \(K\) possible states) follows a Dirichlet distribution with parameters \(\alpha\): \[ A_{j:} \sim \text{ Dir }(\alpha \: \boldsymbol{1}). \] In this case, the MAP(Maximum A Posteriori) estimate(which is the mode of the posterior distribution) becomes \[ \hat{A}_{jk} = \frac{N_{jk}+\alpha}{N_j + K \alpha}. \] If \(\alpha = 1\), this is called add-one smoothing. Note that every state is treated "equally" in the prior.(\(\alpha_1 = \alpha_2 = \cdots = \alpha_K\))

Assume that for the \(j\)ith row, we have observed counts \(N_{jk}\) for transitioning to state \(k\). The total number of transitions from state \(j\) is \[ N_j = \sum_{k=1}^K N_{jk} \] The likelihood under a multinomial model is: \[ P(\{N_{jk} | A_{j:}\}) \propto \prod_{k=1}^K (A_{jk})^{N_{jk}}. \] Now, with the uniform Dirichlet prior, we have: \[ P(A_{j:}) \propto \prod_{k=1}^K (A_{jk})^{\alpha -1}. \] (The prior adds the same amount to each transition probability.)
By Bayes's rule, the posterior is proportional to the product of the likelihood and the prior: \[ \begin{align*} P(A_{j:} | \{N_{jk}\}) &\propto \left[ \prod_{k=1}^K (A_{jk})^{N_{jk}}\right] \times \left[ \prod_{k=1}^K (A_{jk})^{\alpha-1}\right] \\\\ &= \prod_{k=1}^K (A_{jk})^{N_{jk} + \alpha -1}. \end{align*} \] Thus, the posterior distribution for \(A_{j:}\) is a Dirichlet distribution: \[ A_{j:} | \{N_jk\} \sim \text{ Dir } (N_{j1} +\alpha, N_{j2} +\alpha, \cdots \, N_{jK} + \alpha). \] There are two common choices for an estimator from a posterior distribution: the posterior mean and the mode (MAP estimate). For a Dirichlet distribution, the posterior mean is given by: \[ \begin{align*} E [A_{jk} | \{N_{jk}\}] &= \frac{N_{jk}+\alpha}{\sum_{i=1}^K (N_{ji}+\alpha)} \\\\ &= \frac{N_{jk}+\alpha}{N_j + K \alpha}. \end{align*} \] This estimator is often used in practice because it is always defined. The mode (the "actual" MAP estimate) is given by \[ \text{mode }(A_{jk}) = \frac{N_{jk}+\alpha - 1}{N_j + K (\alpha - 1)}. \] However, this is not well-defined when \(\alpha =1 \). Also, in the large-sample regime, the posterior mean and the mode of the Dirichlet distribution are nearly identical. That's why the estimator based on the posterior mean is commonly used and, for \(\alpha =1 \), is referred to as add-one smoothing.

Again, note that this method assums uniform prior(each outcome is "equally" likely a priori), which is NOT realistic in complicated models. For more complex, structured problems where capturing group-level variations is important, hierarchical Bayesian methods generally provide a better approach, despite their increased computational cost. Hierarchical Bayesian models share statistical strength across sequences (e.g., cities). Assume each sequence (e.g., city) has its own transition matrix, drawn from a common hyper-distribution.

Probability & Statistics

Probabilistic Graphical Models

Language Models

Parameter Estimation of Markov Models

A Code Sample (MLE in Markov Models)

Sparse Data & Dirichlet Prior