Introduction to Reinforcement Learning

Introduction

Reinforcement Learning (RL) is a machine learning framework in which an agent interacts with an environment to learn a behavior that maximizes a scalar reward signal over time. Formally, the environment is typically modeled as a Markov Decision Process (MDP), defined by a tuple \[ (\mathcal{S}, \mathcal{A}, \mathcal{T}, r, \gamma) \] where \(\mathcal{S}\) is the state space, \(\mathcal{A}\) is the action space, \(\mathcal{T}\) is the transition function, \(r\) is the reward function, and \(\gamma\) is the discount factor.

At each time step, the agent observes a state \(s_t \in \mathcal{S}\), selects an action \(a_t \in \mathcal{A}\), receives a reward \(r_t\), and transitions to a new state \(s_{t+1}\) according to the transition dynamics \(\mathcal{T}(s_{t+1} \mid s_t, a_t)\). The goal of the agent is to learn a policy \(\pi(a \mid s)\) that maximizes the expected cumulative reward: \[ \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \right]. \]

Reinforcement learning differs fundamentally from other machine learning paradigms:

Supervised learning trains models on ground-truth input-output pairs with explicit correct labels for each input.
Unsupervised learning discovers hidden structure or representations in unlabeled data without any feedback signal.
Reinforcement learning uses only a scalar reward signal that provides evaluative feedback based on the agent's interaction with an environment.

In RL, learning is driven by trial-and-error interaction rather than labeled examples. Unlike supervised learning, there are no correct labels for each input, and unlike unsupervised learning, the objective is not to model data structure but to learn a policy that maximizes expected cumulative reward over time.

RL has been successfully applied in domains such as robotics, recommendation systems, and game-playing. More recently, RL has also become integral to the training of modern large language models (LLMs). In particular, a method called Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune LLMs to follow human preferences, ensuring outputs are more helpful, safe, and better aligned with user intent.

The diversity of RL applications has led to a rich classification of algorithms, each designed to address different aspects of the learning problem. Understanding this classification is crucial for selecting appropriate methods and comprehending the relationships between different approaches.

Classification of RL Algorithms

Model-Based vs Model-Free Methods

The most fundamental distinction in RL concerns whether the agent learns an explicit model of the environment.

Model-based RL methods learn an explicit model of the environment from data, including:

the transition dynamics \(p_T(s' \mid s, a)\), and
the reward model \(p_R(r \mid s, a, s')\).

Once learned, the agent uses planning algorithms to compute an optimal policy \(\pi_*\). Classical planning is based on dynamic programming such as value iteration(VI) and policy iteration(PI). The main advantage is sample efficiency - the agent can simulate many trajectories using the learned model without additional environment interaction. However, model learning introduces model bias: if the learned model is inaccurate, the agent may perform well in simulation but poorly in the real environment.

Model-free RL methods bypass explicit environment modeling and directly learn value functions or policies from experience. These methods are generally simpler to implement and more robust to model errors, but typically require more environment interactions compared to model-based approaches.

Value-Based vs Policy-Based

This taxonomy classifies algorithms based on what they learn in the model-free RL.

Value-based methods learn the optimal Q-function from experience, and then derive policies implicitly: \[ \begin{align*} \pi_*(s) &= \arg\max_a Q_*(s,a) \\\\ &= \arg\max_a [R(s, a) + \gamma \mathbb{E}_{p_T(s' \mid s, a)}[V_*(s')]] \end{align*} \] Given a transition \((s, a, r, s')\), we define the temporal difference (or TD error) as: \[ r + \gamma \max_{a'} Q_w (s', a') - Q_w (s, a) \] where \(Q_w\) is a function approximator, which is trained iteratively.
Note that the expected TD error equals the Bellman error, and when \(Q_w = Q_*\), the TD error is zero in expectation.
(e.g.,Q-learning, SARSA, and Deep Q-Networks(DQN).)

Policy-based methods directly optimize \(J(\pi_{\mathbf{\theta}})\) with respect to the policy parameter \(\mathbf{\theta}\) using policy gradient and can naturally handle continuous action spaces and stochastic policies.
(e.g., REINFORCE, trust region policy optimization (TRPO), and actor-critic methods such as advantage actor-critic(A2C), deep deterministic policy gradient(DDPG), soft actor-critic(SAC), and proximal policy optimization (PPO).)

On-Policy vs Off-Policy Methods

This taxonomy concerns the data usage strategy during learning.

On-policy methods learn the Q-function from data generated by the current policy that is being improved. The agent explores the environment and makes updates to its policy based on the actions it has taken.
(e.g., SARSA, REINFORCE, A2C, and PPO.)

Off-policy methods learn from data generated by a different policy that the one being improved. This allows the agent to reuse old data or learn from data collected by other agents. The policy used to collect data is called the behavioral policy, and the policy being learned is called the target policy. They offer higher sample efficiency than on-policy methods, but require handling distribution mismatch.
(e.g., Q-learning, DQN, DDPG, and SAC.)

Exploration vs. Exploitation

The exploration-exploitation tradeoff is fundamental to reinforcement learning. An agent must balance between:

Exploitation: Choosing actions that yield high rewards based on current knowledge
Exploration: Trying new actions to potentially discover better strategies

\(\epsilon\)-Greedy Strategy

The simplest exploration strategy chooses: \[ a_t = \begin{cases} \arg\max_a Q(s_t, a) & \text{with probability } 1-\varepsilon \\ \text{random action} & \text{with probability } \varepsilon \end{cases} \] where \(\varepsilon \in [0,1]\) controls the exploration rate. Often \(\varepsilon\) is decayed over time.

Softmax Action Selection

A more sophisticated approach uses the Boltzmann (softmax) distribution: \[ P(a_t = a \mid s_t) = \frac{\exp(Q(s_t, a)/\tau)}{\sum_{a'} \exp(Q(s_t, a')/\tau)} \] where \(\tau > 0\) is the temperature parameter. Higher temperatures lead to more exploration.

Optimistic Initialization

Initializing Q-values optimistically (higher than realistic values) encourages exploration of all actions early in learning, as the agent will be "disappointed" by actual rewards and try other actions.

The choice of exploration strategy significantly affects learning performance and is often problem-dependent. More advanced methods like UCB (Upper Confidence Bound) and Thompson sampling provide principled approaches to this tradeoff.

Markov Decision Process (MDP)

Before diving into the details of RL algorithms, this section presents a detailed probabilistic formulation of Markov Decision Processes (MDPs), where both transitions and rewards are modeled as stochastic variables. This perspective extends the classical deterministic view and is particularly useful for analyzing trajectory distributions and gradient-based learning methods used in modern reinforcement learning.

An agent sequentially interacts with an initially unknown environment to obtain a trajectory or multiple trajectories. A trajectory of length \(T\) is defined as: \[ \boldsymbol{\tau} = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, \cdots, s_T), \] where \(s_t\) is a state, \(a_t\) is an action, and \(r_t\) is a reward.

The objective is to optimize the agent's action-selection policy so that the expected discounted cumulative reward is maximized: \[ G_0 = \sum_{t = 0}^{T -1} \gamma^t r_t, \] where \(\gamma \in [0, 1]\) is the discount factor. We assume the environment follows a Markov Decision Process (MDP), where the trajectory distribution can be factored into single-step transition and reward models. The process of estimating an optimal policy from trajectories is referred to as learning.

We define the MDP as the tuple: \[ \left\langle \mathcal{S}, \mathcal{A}, p_T, p_R, p_0 \right\rangle, \] where:

\(\mathcal{S}\): set of environment states
\(\mathcal{A}\): set of available actions
\(p_T(s' \mid s, a)\): transition model (next-state distribution)
\(p_R(r \mid s, a, s')\): reward model (stochastic reward distribution)
\(p_0(s_0)\): initial state distribution

At time \(t = 0\), the initial state is sampled as \(s_0 \sim p_0\). At each step \(t \geq 0\), the agent observes state \(s_t \in \mathcal{S}\), selects action \(a_t \sim \pi(a_t \mid s_t)\), and receives reward \(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\), where the next state is drawn from \(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\). The agent's decision-making is governed by a stochastic policy \(\pi(a \mid s)\).

This interaction at each step is called a transition, represented as the tuple: \[ (s_t, a_t, r_t, s_{t+1}), \] where:

\(a_t \sim \pi(a \mid s_t)\)
\(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\)
\(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\)

Under policy \(\pi\), the joint distribution of a trajectory \(\boldsymbol{\tau}\) of length \(T\) is given by: \[ p(\boldsymbol{\tau}) = p_0(s_0) \prod_{t = 0}^{T -1} \pi(a_t \mid s_t) \, p_T(s_{t+1} \mid s_t, a_t) \, p_R(r_t \mid s_t, a_t, s_{t+1}). \]

We define the expected reward function from the reward model \(p_R\) as the marginal average immediate reward for taking action \(a\) in state \(s\), integrating over possible next states: \[ R(s, a) = \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{p_R(r \mid s, a, s')}[r] \right]. \]

While classical RL literature often defines the reward function deterministically as \(r(s, a)\), this probabilistic formulation allows us to account for uncertainty in transitions and rewards. It is particularly useful for gradient-based RL methods, where trajectory likelihoods must be modeled explicitly.

Value Functions and Bellman Equations

Let \(\boldsymbol{\tau}\) be a trajectory of length \(T\), where \(T\) may be infinite (\(T = \infty\)) for continuing tasks. The return at time \(t\) is defined as the total accumulated reward from that point forward, discounted by a factor \(\gamma \in [0, 1]\): \[ \begin{align*} G_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T - t - 1} r_{T - 1} \\\\ &= \sum_{k = 0}^{T - t - 1} \gamma^k r_{t + k} \\\\ &= \sum_{j = t}^{T - 1} \gamma^{j - t} r_j. \end{align*} \] The discount factor \(\gamma\) ensures that the return remains finite even for infinite-horizon problems, and it gives higher weight to short-term rewards, thereby encouraging the agent to achieve goals sooner.

Given a stochastic policy \(\pi(a \mid s)\), the state-value function (or simply, value function) is defined as: \[ \begin{align*} V_{\pi}(s) &= \mathbb{E}_{\pi} \left[ G_0 \mid s_0 = s \right] \\\\ &= \mathbb{E}_{\pi} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s \right], \end{align*} \] where the expectation is over trajectories induced by the policy \(\pi\).

The action-value function (or Q-function) is defined as: \[ \begin{align*} Q_{\pi}(s, a) &= \mathbb{E}_{\pi} \left[ G_0 \mid s_0 = s, \, a_0 = a \right] \\\\ &= \mathbb{E}_{\pi} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s, \, a_0 = a \right]. \end{align*} \]

The advantage function is defined as: \[ A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s), \] which quantifies how much better it is to take action \(a\) in state \(s\) and then follow policy \(\pi\), compared to just following \(\pi\) from the beginning.

A policy \(\pi_*\) is called an optimal policy if it yields the highest value for every state: \[ \forall s \in \mathcal{S}, \quad V_{\pi_*}(s) \geq V_{\pi}(s), \quad \forall \pi. \] Although multiple optimal policies may exist, their value functions are the same: \(V_*(s) = V_{\pi_*}(s)\) and \(Q_*(s, a) = Q_{\pi_*}(s, a)\). Moreover, every finite MDP admits at least one deterministic optimal policy.

Bellman's Optimality Equations: \[ \begin{align*} V_*(s) &= \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right] \\\\ Q_*(s, a) &= R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \max_{a'} Q_*(s', a') \right] \end{align*} \] The optimal value functions \(V_*\) and \(Q_*\) are the unique fixed points of these equations.

The optimal policy can then be derived by: \[ \begin{align*} \pi_*(s) &= \arg \max_a Q_*(s, a) \\\\ &= \arg \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right]. \end{align*} \] Solving for \(V_*\), \(Q_*\), or \(\pi_*\) is known as policy optimization, while computing \(V_{\pi}\) or \(Q_{\pi}\) for a given policy \(\pi\) is called policy evaluation.

Bellman's equations for policy evaluation can be derived from the optimality equations by replacing each maximization over actions, \(\max_a [\,\cdot\,]\), with an expectation over the policy at the corresponding state, \(\mathbb{E}_{\pi(a \mid s)}[\,\cdot\,]\), wherever action selection occurs.

Bellman's Expectation Equations:

For a fixed policy \(\pi\), the state-value function satisfies:

\[ V_{\pi}(s) = \mathbb{E}_{\pi(a \mid s)} \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_{\pi}(s') \right] \right] \]

The action-value function satisfies:

\[ Q_{\pi}(s, a) = R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{\pi(a' \mid s')} \left[ Q_{\pi}(s', a') \right] \right] \]

Dynamic Programming Algorithms for RL

Dynamic programming (DP) is a technique for solving optimization problems by breaking them down into simpler subproblems and storing the results to avoid redundant computations. In the context of reinforcement learning, DP methods solve MDPs exactly when the complete model (transition probabilities and rewards) is known.

The key insight of DP for MDPs is that optimal policies have the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This is known as Bellman's principle of optimality, which leads to the recursive Bellman equations.

Value Iteration (VI) and Policy Iteration (PI) are the two classical dynamic programming methods for solving MDPs. These algorithms are fundamental to model-based reinforcement learning, where an agent first learns the MDP model from experience and then applies these DP methods to compute optimal policies.

Value Iteration (VI)

Let the initial estimate be \(V_0\). We update it as follows: \[ V_{k+1}(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} p_T(s' \mid s, a) V_k (s') \right]. \] This is called Bellman backup. Each iteration reduces the maximum value function error by a constant factor: \[ \max_s \| V_{k+1}(s) - V_*(s) \| \leq \gamma \max_s \| V_k (s) - V_* (s) \|. \] So, \(V_k\) converges to \(V_*\) as \(k \to \infty\).

For all possible states \(s\), value iteration computes \(V_*(s)\) and \(\pi_*(s)\), averaging over all possible next states \(s'\) at each iteration. (Note: If we need the value and policy for only certain starting states, other methods can be used such as real-time dynamic programming).

VALUE_ITERATION Input: \(M = \left\langle \mathcal{S}, \mathcal{A}, p_T(s' | s, a), R(s, a), \gamma \right\rangle\) Output: \(V_*\), \(\pi_*\) begin Initialize \(V(s)\) arbitrarily for all \(s \in \mathcal{S}\) (typically \(V(s) = 0\)) repeat: \(\Delta \leftarrow 0\) for each \(s \in \mathcal{S}\): \(V^{\text{old}}(s) \leftarrow V(s)\) \(V(s) \leftarrow \max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V(s')\right]\) \(\Delta \leftarrow \max(\Delta, |V^{\text{old}}(s) - V(s)|)\) until \(\Delta < \theta\) (small threshold) // Extract optimal policy for each \(s \in \mathcal{S}\): \(\pi_*(s) \leftarrow \arg\max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V(s')\right]\) end

Policy Iteration (PI)

Let \(\mathbf{v}(s) = V_{\pi}(s)\) be the value function encoded as a vector indexed by states \(s\). Also, the reward vector is represented by \[ \mathbf{r}(s) = \sum_a \pi(a \mid s) R(s, a) \] and the state transition matrix is written as \[ \mathbf{T}(s' \mid s) = \sum_a \pi(a \mid s) p_T(s' \mid s, a). \] Then Bellman's equation for policy evaluation can be represented as a linear system in \(\| \mathcal{S}\|\) unknowns: \[ \mathbf{v} = \mathbf{r} + \gamma \mathbf{T}\mathbf{v} \] Theoretically, we can solve this by \(\mathbf{v} = (\mathbf{I} - \gamma \mathbf{T})^{-1} \mathbf{r}\), but we can compute \(\mathbf{v}_{t+1} = \mathbf{r} + \gamma \mathbf{T} \mathbf{v}_t\) iteratively. This process is called policy evaluation. Once we have evaluated \(V_{\pi}\) for the policy \(\pi\), we need to find a better policy \(\pi'\).

Now we move on to the policy improvement step: \[ \pi'(s) = \arg \max_a \{R(s, a) + \gamma \mathbb{E}[V_{\pi}(s')]\} \] This guarantees \(V_{\pi'} \geq V_{\pi}\) because: \[ \begin{align*} \mathbf{v} = \mathbf{r} + \gamma \mathbf{T}\mathbf{v} &\leq \mathbf{r'} + \gamma \mathbf{T'}\mathbf{v} \\\\ &\leq \mathbf{r'} + \gamma \mathbf{T'}(\mathbf{r'} + \gamma \mathbf{T'}\mathbf{v})\\\\ &\leq \cdots \\\\ &= (\mathbf{I} + \gamma \mathbf{T'} + \gamma^2 \mathbf{T'}^2 + \cdots)\mathbf{r'} \\\\ &= (\mathbf{I} - \gamma \mathbf{T'})^{-1} \mathbf{r'} \\\\ &= \mathbf{v'} \end{align*} \]

POLICY_ITERATION Input: \(M = \left\langle \mathcal{S}, \mathcal{A}, p_T(s' | s, a), R(s, a), \gamma \right\rangle\) Output: \(\pi_*\) begin Initialize \(V_{\pi}(s)\) arbitrarily for all \(s \in \mathcal{S}\) (typically \(V_{\pi}(s) = 0\)) Initialize \(\pi(s)\) arbitrarily for all \(s \in \mathcal{S}\) repeat: // Policy Evaluation repeat: \(\Delta \leftarrow 0\) for each \(s \in \mathcal{S}\): \(V_{\pi}^{\text{old}}(s) \leftarrow V_{\pi}(s)\) \(V_{\pi}(s) \leftarrow \sum_a \pi(a|s) \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V_{\pi}(s')\right]\) \(\Delta \leftarrow \max(\Delta, |V_{\pi}^{\text{old}}(s) - V_{\pi}(s)|)\) until \(\Delta < \theta\) (small threshold) // Policy Improvement \(\text{policy-stable} \leftarrow \text{true}\) for each \(s \in \mathcal{S}\): \(\text{old-action} \leftarrow \pi(s)\) \(\pi(s) \leftarrow \arg\max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V_{\pi}(s')\right]\) if \(\text{old-action} \neq \pi(s)\) then \(\text{policy-stable} \leftarrow \text{false}\) until \(\text{policy-stable} = \text{true}\) end

In policy iteration algorithm, starting from an initial policy, we alternate between policy evaluation step and policy improvement step. Within finite iterations, the algorithm will converge to optimal policy since there are at most \(\|\mathcal{A}\|^{\|\mathcal{S}\|}\) deterministic policies and every update improves the policy.

Monte Carlo Control

Having established the theoretical foundations of MDPs and dynamic programming, we now turn to model-free methods that learn directly from experience. We begin with Monte Carlo (MC) control, a fundamental value-based method that estimates value functions by sampling complete episodes and using actual returns.

The Monte Carlo control method works as follows:

The agent takes action \(a\) in state \(s\)
Samples the rest of the trajectory according to policy \(\pi\)
Computes the actual return \(G_t\) (sum of discounted rewards) from that state-action pair
The episode terminates when a terminal state is reached

We use this approach within a generalized policy iteration framework to find an optimal policy. The process alternates between policy evaluation (using MC sampling to estimate \(Q_\pi(s,a)\)) and policy improvement. At iteration \(k\), the policy improvement step becomes: \[ \pi_{k+1}(s) = \arg \max_{a} Q_k (s, a) \] where \(Q_k\) is estimated using Monte Carlo sampling of complete episodes.

Note that to achieve convergence to the optimal policy while maintaining adequate exploration, we can use an \(\epsilon\)-greedy policy that balances exploitation of current knowledge with exploration of potentially better actions.

Monte Carlo methods provide unbiased estimates since they use actual returns. However, MC methods are not efficient for value-based RL because they require unrolling complete trajectories whose returns are sums of random rewards generated by stochastic state transitions, leading to high variance in value estimates. Moreover, MC methods work well only for episodic tasks since they need complete trajectories for each update step to obtain reliable estimates. This requirement makes them unsuitable for continuing tasks or online learning scenarios.

To address these limitations, we now turn to a more efficient approach that uses bootstrapping — updating estimates based on other estimates rather than waiting for final outcomes. Note that this concept of bootstrapping in reinforcement learning is distinct from statistical bootstrapping used in other areas of machine learning.

Temporal Difference Learning

Temporal Difference (TD) learning methods address the limitations of Monte Carlo methods through bootstrapping — updating value estimates using single-step transitions plus current estimates of successor states rather than waiting for complete returns. In other words, TD methods incrementally reduce the Bellman error for sampled states by learning from individual transitions rather than complete trajectories.

Theoretical Foundation

The TD target \(r_t + \gamma V(s_{t+1})\) serves as a sample-based approximation of the Bellman expectation equation. From our earlier derivation, we know that: \[ V_{\pi}(s) = \mathbb{E}_{\pi} \left[ r + \gamma V_{\pi}(s') \mid s \right] \] The TD update uses the observed reward \(r_t\) and next state \(s_{t+1}\) to approximate this expectation, then adjusts the current estimate toward this improved target. This establishes TD learning as a method for solving the Bellman equations through successive approximation, without requiring knowledge of transition probabilities.

Under appropriate conditions (bounded rewards, finite state space, and appropriate learning rate schedules), tabular TD(0) converges to the true value function \(V_{\pi}(s)\) for the policy being followed.

TD(0): One-Step Temporal Difference

The simplest TD method, TD(0), updates value estimates after each single step. Suppose that the agent learns the value function \(V_{\pi}\) for a fixed policy \(\pi\). Given a state transition \((s, a, r, s')\) where \(a \sim \pi(s)\), we update the estimate \(V(s)\) as follows: \[ V(s_t) \leftarrow V(s_t) + \alpha \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right] \] where:

\(\alpha \in (0,1]\) is the learning rate
\(r_t + \gamma V(s_{t+1})\) is the TD target
\(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error

n-Step TD Methods

TD methods can be generalized to look ahead multiple steps. The n-step return is defined as: \[ G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n}) \] The corresponding n-step TD update becomes: \[ V(S_t) \leftarrow V(S_t) + \alpha [G_{t:t+n} - V(S_t)] \] This provides a spectrum of methods: \(n=1\) gives TD(0), and as \(n\) increases toward the episode length, the method approaches Monte Carlo behavior by using longer actual return sequences.

TD(\(\lambda\)) and \(\lambda\)-returns

Rather than using a fixed n-step lookahead, TD(\(\lambda\)) methods use a weighted average of all n-step returns. The λ-return is defined as: \[ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n} \] where \(\lambda \in [0,1]\) controls the weighting between different n-step returns.

The parameter \(\lambda\) creates a different type of interpolation than n-step methods:

When \(\lambda = 0\): Only the 1-step return is used, reducing to TD(0)
When \(\lambda = 1\): For episodic tasks, this gives Monte Carlo behavior through full credit assignment
For \(0 < \lambda < 1\): Creates a weighted blend that emphasizes shorter-term returns but includes longer-term effects

Note that TD(λ) with \(\lambda = 1\) achieves Monte Carlo-equivalent behavior through a different mechanism than n-step TD with large n. While n-step methods extend the lookahead horizon, TD(λ) uses eligibility traces to distribute credit backward through time, allowing for online updates even in continuing tasks.

The\(\lambda\)-return provides a principled way to interpolate between the low variance but biased estimates of TD(0) and the high variance but unbiased estimates of Monte Carlo methods. TD(0) is thus a special case of the more general TD(\(\lambda\)) framework. Importantly, both n-step TD and TD(\(\lambda\)) can achieve Monte Carlo behavior, but through different parameterizations: n-step methods by extending the lookahead horizon (large n), and TD(λ) methods through the weighting parameter (\(\lambda = 1\)).

Function Approximation

More generally, for function approximation where \(V_\mathbf{w}(s)\) is parameterized by weights \(\mathbf{w}\), the TD(0) update becomes: \[ \mathbf{w} \leftarrow \mathbf{w} + \alpha [ r_t + \gamma V_{\mathbf{w}}(s_{t+1}) - V_{\mathbf{w}}(s_t)]\nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t) \] Note that with function approximation, TD methods may not always converge, unlike the tabular case.

Key Advantages

TD methods offer several key advantages that make them central to modern reinforcement learning:

Online learning: Updates occur after each time step, enabling continuous learning without waiting for episode completion.
Computational efficiency: No need to store or process entire episode returns.
Lower variance: TD(0) has lower variance due to bootstrapping, though this comes at the cost of introducing bias.
Applicability to continuing tasks: Can learn in non-episodic environments where episodes never terminate.

TD Control Methods: Q-Learning & SARSA

Having established the theoretical foundations of temporal difference learning, we now turn to specific TD control algorithms that learn optimal policies by estimating action-value functions. While the previous section focused on TD methods for policy evaluation (learning V_π), these methods extend TD learning to the control problem by learning Q-functions and deriving policies from them.

Both Q-learning and SARSA are temporal difference methods that bootstrap using single-step transitions, but they differ fundamentally in their approach to policy learning: one learns the optimal policy directly (off-policy), while the other learns the policy it follows (on-policy).

SARSA (On-Policy TD Control)

The agent follows a policy \(\pi\) to learn the Q-function in every step to select actions. Under a transition \((s, a, r, s')\), the TD update rule is: \[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right] \] where \(a' \sim \pi(s')\) is the action that the agent will take in state \(s'\). This makes SARSA on-policy.

Note: SARSA is named after the augmented transition \((s, a, r, s', a')\).

Q-Learning (Off-Policy TD Control)

Instead of the sampled next action \(a' \sim \pi(s')\) in SARSA, we use a greed action in \(s'\): \[ a' = \arg \max_b Q(s', b), \] making it off-policy. Thus the TD update rule for a transition \((s, a, r, s')\) is given by: \[ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r_t + \gamma \max_{b} Q(s', b) - Q(s, a) \right]. \]

Key Differences

Convergence: Q-learning converges to \(Q_*\) (optimal action-value function) regardless of policy followed; SARSA converges to \(Q_{\pi}\) for the specific policy being followed
Policy dependence: SARSA accounts for the actual exploration policy in its updates, while Q-learning assumes optimal action selection regardless of the behavioral policy
Behavioral differences: In some environments, SARSA may learn different policies than Q-learning because it considers the exploration policy's actual behavior, while Q-learning optimizes assuming greedy action selection

Both algorithms require exploration to discover good actions, leading us to the fundamental challenge of balancing exploration and exploitation.

Policy Gradient Methods

Unlike value-based methods that learn value functions and derive policies implicitly, policy-based methods directly parameterize and optimize policies without necessarily learning value functions. These methods address fundamental limitations of value-based approaches, particularly in continuous action spaces and when stochastic policies are desired.

Our objective is the expected return of a policy: \[ \begin{align*} J(\pi) &= \mathbb{E}_{p_0(s_0)}[V_{\pi}(s_0)] \\\\ &= \mathbb{E}_{p_0(s_0)\pi(a_0 \mid s_0)}[Q_{\pi}(s_0, a_0)]. \end{align*} \]

Let \(\pi_{\theta}\) be parameterized by \(\mathbf{\theta}\). We compute the gradient of the objective with respect to \(\mathbf{\theta}\). \[ \begin{align*} \nabla_{\mathbf{\theta}} J(\pi_{\theta}) &= \mathbb{E}_{p_0(s_0)} \left[\nabla_{\mathbf{\theta}} \left(\sum_{a_0}\pi_{\mathbf{\theta}}(a_0 \mid s_0)Q_{\pi_{\theta}}(s_0, a_0) \right) \right] \\\\ &= \mathbb{E}_{p_0(s_0)} \left[\sum_{a_0} \nabla \pi_{\mathbf{\theta}}(a_0 \mid s_0)Q_{\pi_{\theta}}(s_0, a_0) \right] + \mathbb{E}_{p_0(s_0)\pi_{\mathbf{\theta}}(a_0 \mid s_0)} [\nabla_{\mathbf{\theta}}Q_{\pi_{\mathbf{\theta}}}(s_0, a_0)]. \end{align*} \] Here, \[ \begin{align*} \nabla_{\mathbf{\theta}}Q_{\pi_{\mathbf{\theta}}}(s_0, a_0) &= \nabla_{\mathbf{\theta}}\left[R(s_0, a_0) + \gamma \mathbb{E}_{p_T(s_1 \mid s_0, a_0)} [V_{\pi_{\mathbf{\theta}}}(s_1)]\right] \\\\ &= \gamma \nabla_{\mathbf{\theta}} \mathbb{E}_{p_T(s_1 \mid s_0, a_0)} [V_{\pi_{\mathbf{\theta}}}(s_1)]. \end{align*} \] Repeating the same steps, we obtain the policy gradient theorem: \[ \begin{align*} \nabla_{\theta}J(\pi_{\theta}) &= \sum_{t = 0}^{\infty} \gamma^t \mathbb{E}_{p_t(s)} \left[\sum_a \nabla_{\theta}\pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a)\right] \\\\ &= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\theta}}^{\infty}(s)} \left[ \sum_a \nabla_{\theta}\pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a) \right] \\\\ &= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\theta}}^{\infty}(s) \pi_{\theta}(a \mid s)} \left[ \nabla_{\theta} \log \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a) \right] \end{align*} \] where \(p_t(s)\) is the probability of visiting s in time \(t\) is the agent starts with \(s_0 \sim p_0\) following \(\pi_{\theta}\), and \[ p_{\pi_{\theta}}^{\infty}(s) = (1 - \gamma) \sum_{t = 0}^{\infty} \gamma^t p_t(s) \] is the normalized discounted state visitation distribution.

To reduce high variance, we can introduce a baseline: \[ b(s) = V_{\pi_{\theta}}(s) \] and then the policy gradient theorem becomes: \[ \nabla_{\theta}J(\pi_{\theta}) = \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\theta}}^{\infty}(s) \pi_{\theta}(a \mid s)} \left[ \nabla_{\theta} \log \pi_{\theta}(a \mid s) (Q_{\pi_{\theta}}(s, a) - b(s)) \right]. \]

REINFORCE is the fundamental policy gradient algorithm that uses Monte Carlo estimation: \[ \begin{align*} \nabla_\theta J(\pi_\theta) &= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\theta}}^{\infty}(s) \pi_{\theta}(a \mid s)} \left[ \nabla_{\theta} \log \pi_{\theta}(a \mid s) Q_{\pi_{\theta}}(s, a) \right]\\\\ &\approx \sum_{t=0}^{T-1} \gamma^t G_t \nabla_\theta \log \pi_\theta(a_t \mid s_t) \end{align*} \] where \(G_t^{(i)}\) is the return from time \(t\): \[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots \gamma^{T-t-1} r_{T-1}. \] Here, using a baseline in the gradient estimte, we obtain the REINFORCE update rule: \[ \mathbf{\theta} \leftarrow \mathbf{\theta} + \alpha \sum_{t = 0}^{T -1} \gamma^t (G_t - b(s_t)) \nabla_\theta \log \pi_\theta (a_t \mid s_t). \]

REINFORCE begin Set initial policy parameters \(\mathbf{\theta}\), and baseline parameters \(\mathbf{w}\) repeat: Sample an episode \(\tau = (s_0, a_0, r_0, s_1, \cdots, s_T)\) using the policy \(\pi_\theta\) Compute \(G_t\) for all \(t \in \{0, 1, \cdots, T-1\}\) for \(t = 0, 1, \cdots, T-1\) do \(\delta = G_t - V_\mathbf{w}(s_t)\) \(\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}}V_{\mathbf{w}(s_t)}\) \(\mathbf{\theta} \leftarrow \mathbf{\theta} + \alpha_{\mathbf{\theta}} \gamma^t \delta \nabla_{\mathbf{\theta}} \log \pi_{\mathbf{\theta}}(a_t \mid s_t) \) until converged end

Actor-Critic Methods

Actor-critic methods are hybrid algorithms that combine both policy-based and value-based approaches to address the high variance problem of pure policy gradient methods like REINFORCE. These methods maintain two separate function approximators that work together:

Actor \(\pi_{\mathbf{\theta}}(a \mid s)\): A parameterized policy that selects actions, where \(\mathbf{\theta}\) are the policy parameters.
Critic \(V_{\mathbf{w}}(s)\): A parameterized state-value function that evaluates states, where \(\mathbf{w}\) are the value function parameters.

The actor updates the policy parameters \(\mathbf{\theta}\) using gradients weighted by the advantage estimates provided by the critic, while the critic updates the value function parameters \(\mathbf{w}\) to more accurately estimate expected returns through temporal difference learning.

Consider the use of the one-step TD(0) method to estimate the return in the episodic case. If we use \(V_w(s_t)\) as a baseline, the REINFORCE update becomes: \[ \begin{align*} \mathbf{\theta} \leftarrow \mathbf{\theta} + \alpha \sum_{t=0}^{T-1} \gamma^t (G_{t:t+1} - V_\mathbf{w}(s_t))\nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \end{align*} \] where \[ G_{t:t+1} = r_t + \gamma V_w (s_{t+1}). \] Here, \(r_t + \gamma V_w (s_{t+1}) - V_w(s_t)\) is a single sample approximation to the advantage function: \[ A(s_t, a_t) = Q(s_t, a_t) - V(s_t). \] This method is called the advantage actor-critic (or A2C).

A2C(Advantage Actor Critic) begin Set initial policy parameters \(\mathbf{\theta}\), and baseline parameters \(\mathbf{w}\) repeat: Sample starting state \(s_0\) of a new episode for \(t = 0, 1, \cdots\) do Sample action \(a_t \sim \pi_{\theta}(\dot \mid s_t)\) Observe next state \(s_{t+1}\) and reward \(r_t\) \(\delta = r_t + \gamma V_\mathbf{w}(s_{t+1}) - V_\mathbf{w}(s_t)\) \(\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}}V_{\mathbf{w}(s_t)}\) \(\mathbf{\theta} \leftarrow \mathbf{\theta} + \alpha_{\mathbf{\theta}} \gamma^t \delta \nabla_{\mathbf{\theta}} \log \pi_{\mathbf{\theta}}(a_t \mid s_t) \) until converged end

Note that the \(n\)-step advantage estimate can be expressed as: \[ A_{\pi_\theta}^{(n)}(s_t, a_t) = G_{t:t+n} - V_{\mathbf{w}}(s_t) \] where \[ G_{t:t+n} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V_{\mathbf{w}}(s_{t+n}). \] We can control the bias-variance tradeoff by adjusting \(n\).

Actor-critic methods offer several key advantages:

Reduced variance: Lower variance than pure policy gradient methods like REINFORCE through value function baselines
Sample efficiency: More efficient learning through bootstrapping compared to Monte Carlo methods
Broad applicability: Can handle both discrete and continuous action spaces with online learning capability

Other Key actor-critic algorithms include:

A3C (Asynchronous Advantage Actor-Critic): Uses multiple parallel agents
DDPG (Deep Deterministic Policy Gradient): For continuous control
PPO (Proximal Policy Optimization): Clips policy updates for stability

Introduction to Reinforcement Learning(RL)

Machine Learning