Skip to content

Reinforcement Learning

Markov Decision Process (MDP)

Markov Process (MP)

Given a sequence of states \(h_t = \{s_1, s_2, ..., s_t\}\), the Markov Process assumes that the next state \(s_{t+1}\) depends only on the current state \(s_t\), i.e.,

\[P(s_{t+1} | s_t) = P(s_{t+1} | h_t)\]

Markov Reward Process (MRP)

MP + Rewards

  • Tuple: \(\langle S, P, R, \gamma \rangle\)

    • \(S\) (State): A (finite) set of states (\(s \in S\)).
    • \(P\) (Transition Probability): \(P(s_{t+1}=s'|s_t=s)\).
    • \(R\) (Reward): Expectation of the immediate reward: \(R(s_t=s) = \mathbb{E}[r_t | s_t=s]\).
    • \(\gamma\) (Discount factor): \(\gamma \in [0, 1]\), avoid infinity in cyclic processes.
  • Return (\(G_t\)): Total discounted reward from time \(t\).

    \[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^{T-t-1} R_T\]

Value Function \(V(s)\) and Bellman Equation

The expected return starting from state \(s\) is defined as the value function:

\[V(s) = \mathbb{E}[G_t | s_t = s] = \mathbb{E}\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \Big| s_t = s\right]\]

The MRP value function satisfies the Bellman Equation:

\[V(s) = \underbrace{R(s)}_{\text{immediate reward}} + \underbrace{\gamma \sum_{s' \in S} P(s'|s)V(s')}_{\text{Discounted sum of future reward}}\]

We can express \(V(s)\) using matrix forms:

\[ \begin{bmatrix} V(s_1) \\ V(s_2) \\ \vdots \\ V(s_N) \end{bmatrix} = \begin{bmatrix} R(s_1) \\ R(s_2) \\ \vdots \\ R(s_N) \end{bmatrix} + \gamma \begin{bmatrix} P(s_1|s_1) & P(s_2|s_1) & \cdots & P(s_N|s_1) \\ P(s_1|s_2) & P(s_2|s_2) & \cdots & P(s_N|s_2) \\ \vdots & \vdots & \ddots & \vdots \\ P(s_1|s_N) & P(s_2|s_N) & \cdots & P(s_N|s_N) \end{bmatrix} \begin{bmatrix} V(s_1) \\ V(s_2) \\ \vdots \\ V(s_N) \end{bmatrix} \]

and obtain the analytical solution of Bellman Equation:

\[V = (I - \gamma P)^{-1} R,\]

where \(I\) is the identity matrix. This matrix inversion is \(O(N^3)\) for N states.

Markov Decision Process (MDP)

Now we add Actions \(A\), a finite set of actions

  • Tuple: \(\langle S, {\color{red}A}, P, R, \gamma \rangle\)
  • Policy (\(\pi\)): The agent's behavior function.

    \[\pi(a|s) = P(a_t = a | s_t = s)\]

Given a policy \(\pi\), the MDP reduces to an MRP \(\langle S, P^\pi, R^\pi, \gamma \rangle\) where:

\[ \begin{aligned} P^\pi(s'|s) &= \sum_{a \in A} \pi(a|s) P(s'|s,a)\\ R^\pi(s) &= \sum_{a \in A} \pi(a|s) R(s,a) \end{aligned} \]

Value Functions: \(V\) and \(Q\)

  • State-value function \(v^\pi(s)\): Expected return starting from \(s\), and following policy \(\pi\)
\[ v^\pi(s) = \sE[G_t | s_t = s] \]
  • Action-value function \(q^\pi(s, a)\): Expected return starting from \(s\), taking action \(a\), and then following policy \(\pi\):
\[ q^\pi(s, a) = \sE[G_t | s_t = s, a_t = a] \]

and we have the relationship:

\[ \begin{aligned} v^\pi(s) &= \sum_{a \in A} \pi(a|s) q^\pi(s, a)\\ q^\pi(s, a) &= R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) v^\pi(s') \end{aligned} \]

Thus, we have Bellman Expectation Equation:

\[ \begin{aligned} v^\pi(s) &= \sum_{a \in A} \pi(a|s) \left( R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) v^\pi(s') \right)\\ &\\ q^\pi(s, a) &= R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) \sum_{a' \in A} \pi(a'|s') q^\pi(s', a') \end{aligned} \]

Solving MDPs: Prediction and Control

Prediction (Policy Evaluation)

Goal: Find value function \(v_\pi\) given policy \(\pi\).

  • Method: Iterative Policy Evaluation.
  • Update Rule (Bellman Expectation Backup):

    \[v_{k+1}(s) \leftarrow \sum_a \pi(a|s) \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) v_k(s') \right)\]

    until convergence.

Control (Finding Optimal Policy)

Goal: Find \(\pi^*\) that maximizes rewards.

\[\pi^*(s) = \arg\max_\pi v_\pi(s)\]

Two primary DP algorithms:

  1. Policy Iteration:

    • Cycle: Evaluation (\(v = v_\pi\)) \(\to\) Improvement (\(\pi' = \text{greedy}(v)\)).

      1. Compute state-action value of a policy \(\pi\):

        \[q^{\pi_i}(s, a) = R(s,a) + \gamma \sum_{s'\in S} P(s'|s,a) v^{\pi_i}(s')\]
      2. Compute new policy \(\pi_{i+1}\) for all \(s\in S\):

        \[\pi_{i+1}(s) = \arg\max_a q^{\pi_i}(s, a)\]
    • Monotonic improvement, meaning the optimal policy \(\pi^*\) is guaranteed to be reached.

  2. Value Iteration:

    • Find optimal value function iteratively with Bellman Optimality backup, and retrieve policy from optimal value function.

      1. Update q function:

        \[q_{k+1}(s, a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) v_k(s')\]
      2. Find best value function:

        \[v_{k+1}(s) = \max_a q_{k+1}(s, a)\]
    • Retrieve optimal policy:

      \[\pi^*(s) = \arg\max_a q^*(s, a)\]

Policy iteration v.s. Value iteration

  • Policy iteration: iteration on policy evaluation & policy improvement until policy converges.
  • Value iteration: iteration on value function (find optimal value), then extract policy once.

Model-Free Prediction and Control

The MDP might be unknown in practice (\({\color{red}R}\) and \({\color{red}P}\)):

\[ \begin{aligned} v_i(s) &= \sum_a \pi(a|s) \left( {\color{red}R(s,a)} + \gamma \sum_{s' \in S} {\color{red}P(s'|s,a)} v_{i-1}(s') \right)\\ q_i(s, a) &= {\color{red}R(s,a)} + \gamma \sum_{s' \in S} {\color{red}P(s'|s,a)} \sum_{a' \in A} \pi(a'|s') q_{i-1}(s', a') \end{aligned} \]

This usually means the full table of \({R(s,a)}\) and \({P(s'|s,a)}\) are unknown.

Model-Free Prediction (Policy Evaluation)

Goal: Estimate \(v_\pi(s)\) given episodes of experience under policy \(\pi\).

1. Monte Carlo (MC) Learning

MC methods learn directly from complete episodes of experience (Episodic tasks, i.e., must terminate):

\[V(S_t) \leftarrow V(S_t) + \alpha (G_t - V(S_t))\]

where: \(G_t\): actual total discounted reward, \(\alpha\) is step size (learning rate).

2. Temporal Difference (TD) Learning

TD methods learn from incomplete episodes by bootstrapping (updating a guess towards a guess), i.e., update \(V(S_t)\) towards the estimated return after one step (\(R_{t+1} + \gamma V(S_{t+1})\)).

  • TD(0) Update Rule:
\[V(S_t) \leftarrow V(S_t) + \alpha (\underbrace{R_{\color{red}t+1} + \gamma V(S_{\color{red}t+1})}_{\text{TD Target}} - V(S_t))\]
  • TD Error (\(\delta_t\)): \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\)

MC vs. TD

  • Bias/Variance:
    • MC: Zero bias (unbiased estimate of \(v_\pi\)), High variance (depends on many random actions/transitions).
    • TD: Some bias (bootstraps on current estimates), Lower variance (depends on only one random action/transition).
  • Efficiency: TD usually converges faster and does not wait for the end of the episode.

Model-Free Control

Goal: Find optimal policy \(\pi^*\) without a model. Without \(P(s'|s,a)\), we cannot use \(V(s)\) to look ahead and choose actions. We must learn Action-Value Function \(Q(s,a)\).

  • Exploration: To ensure all pairs \((s,a)\) are visited, we use \(\epsilon\)-greedy exploration:

    \[ \pi(a|s) = \begin{cases} \underbrace{1 - \epsilon}_{\text{greedy prob}} + \underbrace{\epsilon / |A|}_{\text{random prob}} & \text{if } a = \arg\max Q(s,a) \\ \epsilon / |A| & \text{otherwise} \end{cases} \]

1. SARSA (On-Policy Control)

State-Action-Reward-State-Action. The agent learns the value of the policy being carried out (including the exploration steps).

  • Algorithm:

    1. At state \(S\), choose action \(A\) using \(\epsilon\)-greedy derived from \(Q\).
    2. Take action \(A\), observe \(R, S'\).
    3. Choose next action \(A'\) using \(\epsilon\)-greedy derived from \(Q\) (at \(S'\)).
    4. Update:

      \[Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]\]
    5. \(S \leftarrow S', A \leftarrow A'\).

2. Q-Learning (Off-Policy Control)

The agent learns the value of the optimal policy while following a behavior policy (e.g., \(\epsilon\)-greedy).

  • Algorithm:

    1. At state \(S\), choose action \(A\) using \(\epsilon\)-greedy derived from \(Q\).
    2. Take action \(A\), observe \(R, S'\).
    3. Update (Assume the next action is greedy):

      \[Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma \max_{a'} Q(S', a') - Q(S, A)]\]
    4. \(S \leftarrow S'\).

SARSA vs. Q-Learning

  • Commonalities:
    • Both use \(\epsilon\)-greedy for exploration.
    • Both use Q table for sampling (exploration) and updating.
  • Differences:
    • SARSA (On-Policy): Target is \(R + \gamma Q(S', A')\). Updating based on actual action taken. Safer path (penalizes exploration risks).
    • Q-Learning (Off-Policy): Target is \(R + \gamma \max Q(S', a')\). Updating based on best possible action. Finds optimal path regardless of exploration risks.

Value Function Approximation (VFA)

Function Approximation

We can use a function approximator with a parameter vector \(\mathbf{w}\) to estimate the value function. This allows us to generalize from seen states to unseen states, and avoid storing tabular data \(V(s)\) or \(Q(s,a)\).

\[ \begin{aligned} \hat{v}(s, \mathbf{w}) &\approx v_\pi(s)\\ \hat{q}(s, a, \mathbf{w}) &\approx q_\pi(s, a) \end{aligned} \]

Value Function Approximation (VFA) for Prediction

Goal: Find parameter vector \(\mathbf{w}\) that minimizes the Mean Squared Error (MSE) between the approximate value \(\hat{v}(s, \mathbf{w})\) and the true value \(v_\pi(s)\):

\[ J(\mathbf{w}) = \mathbb{E}_\pi \left[ (v_\pi(s) - \hat{v}(s, \mathbf{w}))^2 \right] \]

Gradient Descent: We update \(\mathbf{w}\) in the direction of the negative gradient:

\[ \Delta \mathbf{w} = -\frac{1}{2} \alpha \nabla_\mathbf{w} J(\mathbf{w}) = \alpha (v_\pi(s) - \hat{v}(s, \mathbf{w})) \nabla_\mathbf{w} \hat{v}(s, \mathbf{w}) \]

Substituting the "True Value" (The Target): Since we do not have an Oracle to tell us the true \(v_\pi(s)\), we substitute it with a Target:

  1. Monte Carlo (MC) with VFA: The target is the actual return \(G_t\).

    \[ \Delta \mathbf{w} = \alpha (G_t - \hat{v}(S_t, \mathbf{w})) \nabla \hat{v}(S_t, \mathbf{w}) \]
  2. TD(0) with VFA (Semi-Gradient): The target is the bootstrapping estimate \(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\).

    \[ \Delta \mathbf{w} = \alpha (R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})) \nabla \hat{v}(S_t, \mathbf{w}) \]
    • Note: This is called Semi-Gradient because we treat the target as a constant and ignore the gradient of \(\hat{v}(S_{t+1}, \mathbf{w})\) with respect to \(\mathbf{w}\).

VFA for Control

To perform control (finding optimal policy), we approximate the Action-Value Function \(\hat{q}(s,a,\mathbf{w}) \approx q_\pi(s,a)\).

Generalized Policy Iteration:

  • Evaluation: Approximate policy evaluation using \(\hat{q}\).
  • Improvement: \(\epsilon\)-greedy policy improvement.

Update Rules:

  • SARSA (On-Policy):

    \[ \Delta \mathbf{w} = \alpha ({\color{red}R_{t+1} + \gamma \hat{q}(s_{t+1}, a_{t+1}, \mathbf{w})} - \hat{q}(s_t, a_t, \mathbf{w})) \nabla \hat{q}(s_t, a_t, \mathbf{w}) \]
  • Q-Learning (Off-Policy):

    \[ \Delta \mathbf{w} = \alpha ({\color{red}R_{t+1} + \gamma \max_{a'} \hat{q}(s_{t+1}, a', \mathbf{w})} - \hat{q}(s_t, a_t, \mathbf{w})) \nabla \hat{q}(s_t, a_t, \mathbf{w}) \]

The Deadly Triad

  1. Function Approximation: Scalable way to generalize across large state/action spaces.
  2. Bootstrapping: Updating includes existing estimates (e.g., TD methods) rather than actual reward and complete returns (e.g., MC methods).
  3. Off-Policy Training: Training on a distribution of transitions other than that produced by the target policy (e.g., Q-learning).

Deep Q-Networks (DQN)

DQN combined Q-Learning with CNNs to achieve human-level performance on Atari games. To solve the instability of the "Deadly Triad", it introduced two key mechanisms:

Experience Replay

  • Problem: Samples in RL are sequential and highly correlated (\(S_t\) is very similar to \(S_{t+1}\)).
  • Solution: Store transitions \((s_t, a_t, r_t, s_{t+1})\) in a large Replay Memory \(\mathcal{D}\). During training, randomly sample a mini-batch of transitions to update the network.

Fixed Q-Targets

  • Problem: In standard Q-learning, the target \(y = R + \gamma \max \hat{q}(S', a'; \mathbf{w})\) depends on the same network parameters \(\mathbf{w}\) that are being updated.

  • Solution: Use two separate networks:

    1. Main Network (\(\mathbf{w}\)): Evaluates the current state action pair. Updated at every step.
    2. Target Network (\(\mathbf{w}^-\)): Generates the target value. Its parameters are frozen and only copied from the Main Network periodically (e.g., every 1000 steps).
    \[ Target = r + \gamma \max_{a'} \hat{q}(s', a'; \mathbf{w}^-) \]

Policy Optimization I

  • Value-Based RL: Learns the value function \(V(s)\) or \(Q(s,a)\) and generates a policy implicitly (e.g., \(\epsilon\)-greedy).
  • Policy-Based RL: Learns the policy \(\pi_{\theta}(a|s)\) directly without learning a value function.
  • Actor-Critic: Learns both the policy (Actor) and the value function (Critic).

Policy Objective Functions

The goal is to find parameters \(\theta\) that maximize the expected return \(J(\theta)\).

  • Episodic Environments: Maximize start value.

    \[J_1(\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}_{\pi_\theta}[v_1]\]
  • Continuing Environments:

    • Average Value: Weighted average of state values based on stationary distribution \(d^{\pi_\theta}(s)\).

      \[J_{avV}(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta}(s)\]
    • Average Reward: Average immediate reward per time-step.

      \[J_{avR}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta(s,a) R(s,a)\]

Policy Gradient

We use the Log-Likelihood Ratio Trick to calculate the gradient of the expectation:

\[\nabla_\theta \pi_\theta(s,a) = \pi_\theta(s,a)\frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} = \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a)\]

The gradient of the objective function becomes:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \nabla_\theta \log P(\tau; \theta) \right] \]

This simplifies to summing gradients over time steps:

\[ \nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^m R(\tau_i) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) \]

Policy Gradient for MDPs

Policy Gradient for One-Step MDPs

Consider a simplified environment where the agent starts in state \(s \sim d(s)\), takes one action \(a\), receives reward \(r = R(s,a)\), and the episode terminates.

  • Objective Function:

    \[J(\theta) = \mathbb{E}_{\pi_\theta}[r] = \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi_\theta(s,a) R(s,a)\]
  • Gradient Derivation: We take the gradient with respect to \(\theta\):

    \[ \begin{aligned} \nabla_\theta J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi_\theta(s,a) R(s,a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} R(s,a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a) R(s,a) \\ &= \mathbb{E}_{\pi_\theta} [r \nabla_\theta \log \pi_\theta(s,a)] \end{aligned} \]

Policy Gradient for Multi-Step MDPs

We consider a trajectory \(\tau = (s_0, a_0, r_1, ..., s_{T-1}, a_{T-1}, r_T)\). The total reward for a trajectory is \(R(\tau) = \sum_{t=0}^{T-1} R(s_t, a_t)\).

  • Objective Function:

    \[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_\tau P(\tau; \theta) R(\tau)\]

    where \(P(\tau; \theta)\) is the probability of the trajectory occurring under policy \(\pi_\theta\).

  • Gradient Derivation:

    1. Apply Log-Likelihood Ratio Trick:

      \[ \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \sum_\tau P(\tau; \theta) R(\tau) \\ &= \sum_\tau P(\tau; \theta) \nabla_\theta \log P(\tau; \theta) R(\tau) \\ &= \mathbb{E}_{\tau \sim \pi_\theta} [\nabla_\theta \log P(\tau; \theta) R(\tau)] \end{aligned} \]
    2. Decompose the Trajectory Probability: The probability of a trajectory is the product of initial state prob, policy probs, and transition probs:

      \[P(\tau; \theta) = \mu(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t, a_t)\]

      Taking the log decomposes the product into a sum:

      \[\log P(\tau; \theta) = \log \mu(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t, a_t)\]
    3. Calculate the Gradient: Since the environment dynamics terms (\(\log \mu(s_0)\) and \(\log P(s_{t+1}|s_t, a_t)\)) do not depend on \(\theta\), their gradients are zero. Thus:

      \[ \nabla_\theta \log P(\tau; \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \]
  • Final Update Rule: We approximate the expectation by sampling \(m\) trajectories:

    \[ \nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^m R(\tau_i) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) \]
    • Key Insight: The gradient depends only on the policy gradients. This allows us to maximize the expected reward without knowing the dynamics model \(P(s'|s,a)\) (Model-Free).

REINFORCE Algorithm

REINFORCE is a Monte-Carlo Policy Gradient algorithm. 1. Sample episode trajectories following \(\pi_\theta\). 2. Update \(\theta\) using the estimated gradient:

$$
\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \ln \pi(A_t | S_t, \theta)
$$

where $G_t$ is the return from step $t$.

Variance Reduction

The raw Policy Gradient has high variance. We use two main techniques to reduce it:

A. Causality (Temporal Structure)

Rewards from the past cannot be affected by actions in the future. Instead of using the full trajectory return \(R(\tau)\), we use the return-to-go \(G_t\):

\[ \nabla_\theta \mathbb{E}[R] \approx \frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{T-1} G_t^{(i)} \cdot \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) \]

B. Baselines

We can subtract a baseline \(b(s)\) from the return. This does not change the expected gradient (unbiased) but significantly reduces variance.

\[ \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T-1} (G_t - b(s_t)) \nabla_\theta \log \pi_\theta(a_t | s_t) \right] \]
  • A good baseline is the state value function: \(b(s_t) \approx V(s_t)\).

Actor-Critic

To further reduce variance (at the cost of some bias), we replace the noisy MC return \(G_t\) with a value function estimate (Critic).

  • Actor: The policy \(\pi_\theta(a|s)\) (updates \(\theta\)).
  • Critic: The value function \(Q_w(s,a)\) or \(V_w(s)\) (updates \(w\)).

Advantage Actor-Critic: Using the Advantage Function \(A(s,a)\) combines the Critic with a Baseline:

\[ A^{\pi}(s,a) = Q^{\pi}(s,a) - V^{\pi}(s) \]

The gradient becomes:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} [\nabla_\theta \log \pi_\theta(s,a) A^{\pi}(s,a)] \]

N-step Estimators: We can blend TD and MC methods using N-step returns to trade off bias and variance: * 1-step (TD): Low variance, High bias. * \(\infty\)-step (MC): High variance, Low bias.