Intro to Reinforcement Learning

Introduction

Reinforcement Learning (RL) is a machine learning framework in which an agent interacts with an environment to learn a behavior that maximizes a scalar reward signal over time. Formally, the environment is typically modeled as a Markov Decision Process (MDP), defined by a tuple \[ (\mathcal{S}, \mathcal{A}, \mathcal{T}, r, \gamma) \] where \(\mathcal{S}\) is the state space, \(\mathcal{A}\) is the action space, \(\mathcal{T}\) is the transition function, \(r\) is the reward function, and \(\gamma\) is the discount factor.

At each time step, the agent observes a state \(s_t \in \mathcal{S}\), selects an action \(a_t \in \mathcal{A}\), receives a reward \(r_t\), and transitions to a new state \(s_{t+1}\) according to the transition dynamics \(\mathcal{T}(s_{t+1} \mid s_t, a_t)\). The goal of the agent is to learn a policy \(\pi(a \mid s)\) that maximizes the expected cumulative reward: \[ \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \right]. \]

Reinforcement learning differs from supervised learning, where models are trained on ground-truth input-output pairs, and from unsupervised learning, which aims to discover hidden structure or representations in unlabeled data. In RL, learning is driven by a scalar reward signal that provides evaluative feedback based on the agent's interaction with an environment. Unlike supervised learning, there are no correct labels for each input, and unlike unsupervised learning, the objective is not to model data structure but to learn a policy that maximizes expected cumulative reward over time.

RL has been successfully applied in domains such as robotics, recommendation systems, and game-playing. More recently, RL has also become integral to the training of modern large language models (LLMs). In particular, a method called Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune LLMs to follow human preferences, ensuring outputs are more helpful, safe, and aligned with user intent.

Model-Based and Model-Free Reinforcement Learning

Reinforcement Learning (RL) algorithms can be broadly categorized into two classes based on how they interact with the environment: model-based and model-free methods.

Model-based RL aims to learn an explicit model of the environment from data. This includes learning:

the transition dynamics \(p_T(s' \mid s, a)\), and
the reward model \(p_R(r \mid s, a, s')\).

Once the model is learned from trajectories, classical planning algorithms such as value iteration or policy iteration can be applied to compute an optimal policy.

In contrast, model-free RL methods bypass explicit modeling of the environment. Instead, they directly learn value functions (e.g., Q-learning) or policies (e.g., REINFORCE, PPO) based on sampled experience. These methods are generally simpler to implement but may require more interaction data.

In both paradigms, the environment is formally assumed to follow a Markov Decision Process (MDP), which we now define in the following section.

Markov Decision Process (MDP)

In this section, we present a probabilistic formulation of Markov Decision Processes (MDPs), where both transitions and rewards are modeled as stochastic variables. This perspective is particularly useful for analyzing trajectory distributions and gradient-based learning methods.

An agent sequentially interacts with an initially unknown environment to obtain a trajectory or multiple trajectories. A trajectory of length \(T\) is defined as: \[ \boldsymbol{\tau} = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, \cdots, s_T), \] where \(s_t\) is a state, \(a_t\) is an action, and \(r_t\) is a reward.

The objective is to optimize the agent's action-selection policy so that the expected discounted cumulative reward is maximized: \[ G_0 = \sum_{t = 0}^{T -1} \gamma^t r_t, \] where \(\gamma \in [0, 1]\) is the discount factor. We assume the environment follows a Markov Decision Process (MDP), where the trajectory distribution can be factored into single-step transition and reward models. The process of estimating an optimal policy from trajectories is referred to as learning.

We define the MDP as the tuple: \[ \left\langle \mathcal{S}, \mathcal{A}, p_T, p_R, p_0 \right\rangle, \] where:

\(\mathcal{S}\): set of environment states
\(\mathcal{A}\): set of available actions
\(p_T(s' \mid s, a)\): transition model (next-state distribution)
\(p_R(r \mid s, a, s')\): reward model (stochastic reward distribution)
\(p_0(s_0)\): initial state distribution

At time \(t = 0\), the initial state is sampled as \(s_0 \sim p_0\). At each step \(t \geq 0\), the agent observes state \(s_t \in \mathcal{S}\), selects action \(a_t \sim \pi(a_t \mid s_t)\), and receives reward \(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\), where the next state is drawn from \(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\). The agent's decision-making is governed by a stochastic policy \(\pi(a \mid s)\).

This interaction at each step is called a transition, represented as the tuple: \[ (s_t, a_t, r_t, s_{t+1}), \] where:

\(a_t \sim \pi(a \mid s_t)\)
\(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\)
\(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\)

Under policy \(\pi\), the joint distribution of a trajectory \(\boldsymbol{\tau}\) of length \(T\) is given by: \[ p(\boldsymbol{\tau}) = p_0(s_0) \prod_{t = 0}^{T -1} \pi(a_t \mid s_t) \, p_T(s_{t+1} \mid s_t, a_t) \, p_R(r_t \mid s_t, a_t, s_{t+1}). \]

We define the expected reward function from the reward model \(p_R\) as the marginal average immediate reward for taking action \(a\) in state \(s\), integrating over possible next states: \[ R(s, a) = \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{p_R(r \mid s, a, s')}[r] \right]. \]

While classical RL literature often defines the reward function deterministically as \(r(s, a)\), this probabilistic formulation allows us to account for uncertainty in transitions and rewards. It is particularly useful for gradient-based RL methods, where trajectory likelihoods must be modeled explicitly.

Return and Value Functions

Let \(\boldsymbol{\tau}\) be a trajectory of length \(T\), where \(T\) may be infinite (\(T = \infty\)) for continuing tasks. The return at time \(t\) is defined as the total accumulated reward from that point forward, discounted by a factor \(\gamma \in [0, 1]\): \[ \begin{align*} G_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T - t - 1} r_{T - 1} \\\\ &= \sum_{k = 0}^{T - t - 1} \gamma^k r_{t + k} \\\\ &= \sum_{j = t}^{T - 1} \gamma^{j - t} r_j. \end{align*} \] The discount factor \(\gamma\) ensures that the return remains finite even for infinite-horizon problems, and it gives higher weight to short-term rewards, thereby encouraging the agent to achieve goals sooner.

Given a stochastic policy \(\pi(a \mid s)\), the state-value function (or simply, value function) is defined as: \[ \begin{align*} V_{\pi}(s) &= \mathbb{E}_{\pi} \left[ G_0 \mid s_0 = s \right] \\\\ &= \mathbb{E}_{\pi} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s \right], \end{align*} \] where the expectation is over trajectories induced by the policy \(\pi\).

The action-value function (or Q-function) is defined as: \[ \begin{align*} Q_{\pi}(s, a) &= \mathbb{E}_{\pi} \left[ G_0 \mid s_0 = s, \, a_0 = a \right] \\\\ &= \mathbb{E}_{\pi} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s, \, a_0 = a \right]. \end{align*} \]

The advantage function is defined as: \[ A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s), \] which quantifies how much better it is to take action \(a\) in state \(s\) and then follow policy \(\pi\), compared to just following \(\pi\) from the beginning.

A policy \(\pi_*\) is called an optimal policy if it yields the highest value for every state: \[ \forall s \in \mathcal{S}, \quad V_{\pi_*}(s) \geq V_{\pi}(s), \quad \forall \pi. \] Although multiple optimal policies may exist, their value functions are the same: \(V_*(s) = V_{\pi_*}(s)\) and \(Q_*(s, a) = Q_{\pi_*}(s, a)\). Moreover, every finite MDP admits at least one deterministic optimal policy.

Bellman's Optimality Equations: \[ \begin{align*} V_*(s) &= \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right] \\\\ Q_*(s, a) &= R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \max_{a'} Q_*(s', a') \right] \end{align*} \] The optimal value functions \(V_*\) and \(Q_*\) are the unique fixed points of these equations.

The optimal policy can then be derived by: \[ \begin{align*} \pi_*(s) &= \arg \max_a Q_*(s, a) \\\\ &= \arg \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right]. \end{align*} \] Solving for \(V_*\), \(Q_*\), or \(\pi_*\) is known as policy optimization, while computing \(V_{\pi}\) or \(Q_{\pi}\) for a given policy \(\pi\) is called policy evaluation.

Bellman's equations for policy evaluation can be derived from the optimality equations by replacing each maximization over actions, \(\max_a [\,\cdot\,]\), with an expectation over the policy at the corresponding state, \(\mathbb{E}_{\pi(a \mid s)}[\,\cdot\,]\), wherever action selection occurs.

Bellman's Expectation Equations:

For a fixed policy \(\pi\), the state-value function satisfies:

\[ V_{\pi}(s) = \mathbb{E}_{\pi(a \mid s)} \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_{\pi}(s') \right] \right] \]

The action-value function satisfies:

\[ Q_{\pi}(s, a) = R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{\pi(a' \mid s')} \left[ Q_{\pi}(s', a') \right] \right] \]

Intro to Reinforcement Learning

Machine Learning

Introduction

Model-Based and Model-Free Reinforcement Learning

Markov Decision Process (MDP)

Return and Value Functions