Reinforcement Learning
Markov Decision Process (MDP)
Markov Process (MP)
Given a sequence of states \(h_t = \{s_1, s_2, ..., s_t\}\), the Markov Process assumes that the next state \(s_{t+1}\) depends only on the current state \(s_t\), i.e.,
Markov Reward Process (MRP)
MP + Rewards
-
Tuple: \(\langle S, P, R, \gamma \rangle\)
- \(S\) (State): A (finite) set of states (\(s \in S\)).
- \(P\) (Transition Probability): \(P(s_{t+1}=s'|s_t=s)\).
- \(R\) (Reward): Expectation of the immediate reward: \(R(s_t=s) = \mathbb{E}[r_t | s_t=s]\).
- \(\gamma\) (Discount factor): \(\gamma \in [0, 1]\), avoid infinity in cyclic processes.
-
Return (\(G_t\)): Total discounted reward from time \(t\).
\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... + \gamma^{T-t-1} R_T\]
Value Function \(V(s)\) and Bellman Equation
The expected return starting from state \(s\) is defined as the value function:
The MRP value function satisfies the Bellman Equation:
We can express \(V(s)\) using matrix forms:
and obtain the analytical solution of Bellman Equation:
where \(I\) is the identity matrix. This matrix inversion is \(O(N^3)\) for N states.
Markov Decision Process (MDP)
Now we add Actions \(A\), a finite set of actions
- Tuple: \(\langle S, {\color{red}A}, P, R, \gamma \rangle\)
-
Policy (\(\pi\)): The agent's behavior function.
\[\pi(a|s) = P(a_t = a | s_t = s)\]
Given a policy \(\pi\), the MDP reduces to an MRP \(\langle S, P^\pi, R^\pi, \gamma \rangle\) where:
Value Functions: \(V\) and \(Q\)
- State-value function \(v^\pi(s)\): Expected return starting from \(s\), and following policy \(\pi\)
- Action-value function \(q^\pi(s, a)\): Expected return starting from \(s\), taking action \(a\), and then following policy \(\pi\):
and we have the relationship:
Thus, we have Bellman Expectation Equation:
Solving MDPs: Prediction and Control
Prediction (Policy Evaluation)
Goal: Find value function \(v_\pi\) given policy \(\pi\).
- Method: Iterative Policy Evaluation.
-
Update Rule (Bellman Expectation Backup):
\[v_{k+1}(s) \leftarrow \sum_a \pi(a|s) \left( R(s,a) + \gamma \sum_{s'} P(s'|s,a) v_k(s') \right)\]until convergence.
Control (Finding Optimal Policy)
Goal: Find \(\pi^*\) that maximizes rewards.
Two primary DP algorithms:
-
Policy Iteration:
-
Cycle: Evaluation (\(v = v_\pi\)) \(\to\) Improvement (\(\pi' = \text{greedy}(v)\)).
-
Compute state-action value of a policy \(\pi\):
\[q^{\pi_i}(s, a) = R(s,a) + \gamma \sum_{s'\in S} P(s'|s,a) v^{\pi_i}(s')\] -
Compute new policy \(\pi_{i+1}\) for all \(s\in S\):
\[\pi_{i+1}(s) = \arg\max_a q^{\pi_i}(s, a)\]
-
-
Monotonic improvement, meaning the optimal policy \(\pi^*\) is guaranteed to be reached.
-
-
Value Iteration:
-
Find optimal value function iteratively with Bellman Optimality backup, and retrieve policy from optimal value function.
-
Update q function:
\[q_{k+1}(s, a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) v_k(s')\] -
Find best value function:
\[v_{k+1}(s) = \max_a q_{k+1}(s, a)\]
-
-
Retrieve optimal policy:
\[\pi^*(s) = \arg\max_a q^*(s, a)\]
-
Policy iteration v.s. Value iteration
- Policy iteration: iteration on policy evaluation & policy improvement until policy converges.
- Value iteration: iteration on value function (find optimal value), then extract policy once.
Model-Free Prediction and Control
The MDP might be unknown in practice (\({\color{red}R}\) and \({\color{red}P}\)):
This usually means the full table of \({R(s,a)}\) and \({P(s'|s,a)}\) are unknown.
Model-Free Prediction (Policy Evaluation)
Goal: Estimate \(v_\pi(s)\) given episodes of experience under policy \(\pi\).
1. Monte Carlo (MC) Learning
MC methods learn directly from complete episodes of experience (Episodic tasks, i.e., must terminate):
where: \(G_t\): actual total discounted reward, \(\alpha\) is step size (learning rate).
2. Temporal Difference (TD) Learning
TD methods learn from incomplete episodes by bootstrapping (updating a guess towards a guess), i.e., update \(V(S_t)\) towards the estimated return after one step (\(R_{t+1} + \gamma V(S_{t+1})\)).
- TD(0) Update Rule:
- TD Error (\(\delta_t\)): \(\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\)
MC vs. TD
- Bias/Variance:
- MC: Zero bias (unbiased estimate of \(v_\pi\)), High variance (depends on many random actions/transitions).
- TD: Some bias (bootstraps on current estimates), Lower variance (depends on only one random action/transition).
- Efficiency: TD usually converges faster and does not wait for the end of the episode.
Model-Free Control
Goal: Find optimal policy \(\pi^*\) without a model. Without \(P(s'|s,a)\), we cannot use \(V(s)\) to look ahead and choose actions. We must learn Action-Value Function \(Q(s,a)\).
-
Exploration: To ensure all pairs \((s,a)\) are visited, we use \(\epsilon\)-greedy exploration:
\[ \pi(a|s) = \begin{cases} \underbrace{1 - \epsilon}_{\text{greedy prob}} + \underbrace{\epsilon / |A|}_{\text{random prob}} & \text{if } a = \arg\max Q(s,a) \\ \epsilon / |A| & \text{otherwise} \end{cases} \]
1. SARSA (On-Policy Control)
State-Action-Reward-State-Action. The agent learns the value of the policy being carried out (including the exploration steps).
-
Algorithm:
- At state \(S\), choose action \(A\) using \(\epsilon\)-greedy derived from \(Q\).
- Take action \(A\), observe \(R, S'\).
- Choose next action \(A'\) using \(\epsilon\)-greedy derived from \(Q\) (at \(S'\)).
-
Update:
\[Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]\] -
\(S \leftarrow S', A \leftarrow A'\).
2. Q-Learning (Off-Policy Control)
The agent learns the value of the optimal policy while following a behavior policy (e.g., \(\epsilon\)-greedy).
-
Algorithm:
- At state \(S\), choose action \(A\) using \(\epsilon\)-greedy derived from \(Q\).
- Take action \(A\), observe \(R, S'\).
-
Update (Assume the next action is greedy):
\[Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma \max_{a'} Q(S', a') - Q(S, A)]\] -
\(S \leftarrow S'\).
SARSA vs. Q-Learning
- Commonalities:
- Both use \(\epsilon\)-greedy for exploration.
- Both use Q table for sampling (exploration) and updating.
- Differences:
- SARSA (On-Policy): Target is \(R + \gamma Q(S', A')\). Updating based on actual action taken. Safer path (penalizes exploration risks).
- Q-Learning (Off-Policy): Target is \(R + \gamma \max Q(S', a')\). Updating based on best possible action. Finds optimal path regardless of exploration risks.
Value Function Approximation (VFA)
Function Approximation
We can use a function approximator with a parameter vector \(\mathbf{w}\) to estimate the value function. This allows us to generalize from seen states to unseen states, and avoid storing tabular data \(V(s)\) or \(Q(s,a)\).
Value Function Approximation (VFA) for Prediction
Goal: Find parameter vector \(\mathbf{w}\) that minimizes the Mean Squared Error (MSE) between the approximate value \(\hat{v}(s, \mathbf{w})\) and the true value \(v_\pi(s)\):
Gradient Descent: We update \(\mathbf{w}\) in the direction of the negative gradient:
Substituting the "True Value" (The Target): Since we do not have an Oracle to tell us the true \(v_\pi(s)\), we substitute it with a Target:
-
Monte Carlo (MC) with VFA: The target is the actual return \(G_t\).
\[ \Delta \mathbf{w} = \alpha (G_t - \hat{v}(S_t, \mathbf{w})) \nabla \hat{v}(S_t, \mathbf{w}) \] -
TD(0) with VFA (Semi-Gradient): The target is the bootstrapping estimate \(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})\).
\[ \Delta \mathbf{w} = \alpha (R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})) \nabla \hat{v}(S_t, \mathbf{w}) \]- Note: This is called Semi-Gradient because we treat the target as a constant and ignore the gradient of \(\hat{v}(S_{t+1}, \mathbf{w})\) with respect to \(\mathbf{w}\).
VFA for Control
To perform control (finding optimal policy), we approximate the Action-Value Function \(\hat{q}(s,a,\mathbf{w}) \approx q_\pi(s,a)\).
Generalized Policy Iteration:
- Evaluation: Approximate policy evaluation using \(\hat{q}\).
- Improvement: \(\epsilon\)-greedy policy improvement.
Update Rules:
-
SARSA (On-Policy):
\[ \Delta \mathbf{w} = \alpha ({\color{red}R_{t+1} + \gamma \hat{q}(s_{t+1}, a_{t+1}, \mathbf{w})} - \hat{q}(s_t, a_t, \mathbf{w})) \nabla \hat{q}(s_t, a_t, \mathbf{w}) \] -
Q-Learning (Off-Policy):
\[ \Delta \mathbf{w} = \alpha ({\color{red}R_{t+1} + \gamma \max_{a'} \hat{q}(s_{t+1}, a', \mathbf{w})} - \hat{q}(s_t, a_t, \mathbf{w})) \nabla \hat{q}(s_t, a_t, \mathbf{w}) \]
The Deadly Triad
- Function Approximation: Scalable way to generalize across large state/action spaces.
- Bootstrapping: Updating includes existing estimates (e.g., TD methods) rather than actual reward and complete returns (e.g., MC methods).
- Off-Policy Training: Training on a distribution of transitions other than that produced by the target policy (e.g., Q-learning).
Deep Q-Networks (DQN)
DQN combined Q-Learning with CNNs to achieve human-level performance on Atari games. To solve the instability of the "Deadly Triad", it introduced two key mechanisms:
Experience Replay
- Problem: Samples in RL are sequential and highly correlated (\(S_t\) is very similar to \(S_{t+1}\)).
- Solution: Store transitions \((s_t, a_t, r_t, s_{t+1})\) in a large Replay Memory \(\mathcal{D}\). During training, randomly sample a mini-batch of transitions to update the network.
Fixed Q-Targets
-
Problem: In standard Q-learning, the target \(y = R + \gamma \max \hat{q}(S', a'; \mathbf{w})\) depends on the same network parameters \(\mathbf{w}\) that are being updated.
-
Solution: Use two separate networks:
- Main Network (\(\mathbf{w}\)): Evaluates the current state action pair. Updated at every step.
- Target Network (\(\mathbf{w}^-\)): Generates the target value. Its parameters are frozen and only copied from the Main Network periodically (e.g., every 1000 steps).
\[ Target = r + \gamma \max_{a'} \hat{q}(s', a'; \mathbf{w}^-) \]
Policy Optimization I
- Value-Based RL: Learns the value function \(V(s)\) or \(Q(s,a)\) and generates a policy implicitly (e.g., \(\epsilon\)-greedy).
- Policy-Based RL: Learns the policy \(\pi_{\theta}(a|s)\) directly without learning a value function.
- Actor-Critic: Learns both the policy (Actor) and the value function (Critic).
Policy Objective Functions
The goal is to find parameters \(\theta\) that maximize the expected return \(J(\theta)\).
-
Episodic Environments: Maximize start value.
\[J_1(\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}_{\pi_\theta}[v_1]\] -
Continuing Environments:
-
Average Value: Weighted average of state values based on stationary distribution \(d^{\pi_\theta}(s)\).
\[J_{avV}(\theta) = \sum_s d^{\pi_\theta}(s) V^{\pi_\theta}(s)\] -
Average Reward: Average immediate reward per time-step.
\[J_{avR}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta(s,a) R(s,a)\]
-
Policy Gradient
We use the Log-Likelihood Ratio Trick to calculate the gradient of the expectation:
The gradient of the objective function becomes:
This simplifies to summing gradients over time steps:
Policy Gradient for MDPs
Policy Gradient for One-Step MDPs
Consider a simplified environment where the agent starts in state \(s \sim d(s)\), takes one action \(a\), receives reward \(r = R(s,a)\), and the episode terminates.
-
Objective Function:
\[J(\theta) = \mathbb{E}_{\pi_\theta}[r] = \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi_\theta(s,a) R(s,a)\] -
Gradient Derivation: We take the gradient with respect to \(\theta\):
\[ \begin{aligned} \nabla_\theta J(\theta) &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \nabla_\theta \pi_\theta(s,a) R(s,a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi_\theta(s,a) \frac{\nabla_\theta \pi_\theta(s,a)}{\pi_\theta(s,a)} R(s,a) \\ &= \sum_{s \in \mathcal{S}} d(s) \sum_{a \in \mathcal{A}} \pi_\theta(s,a) \nabla_\theta \log \pi_\theta(s,a) R(s,a) \\ &= \mathbb{E}_{\pi_\theta} [r \nabla_\theta \log \pi_\theta(s,a)] \end{aligned} \]
Policy Gradient for Multi-Step MDPs
We consider a trajectory \(\tau = (s_0, a_0, r_1, ..., s_{T-1}, a_{T-1}, r_T)\). The total reward for a trajectory is \(R(\tau) = \sum_{t=0}^{T-1} R(s_t, a_t)\).
-
Objective Function:
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_\tau P(\tau; \theta) R(\tau)\]where \(P(\tau; \theta)\) is the probability of the trajectory occurring under policy \(\pi_\theta\).
-
Gradient Derivation:
-
Apply Log-Likelihood Ratio Trick:
\[ \begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \sum_\tau P(\tau; \theta) R(\tau) \\ &= \sum_\tau P(\tau; \theta) \nabla_\theta \log P(\tau; \theta) R(\tau) \\ &= \mathbb{E}_{\tau \sim \pi_\theta} [\nabla_\theta \log P(\tau; \theta) R(\tau)] \end{aligned} \] -
Decompose the Trajectory Probability: The probability of a trajectory is the product of initial state prob, policy probs, and transition probs:
\[P(\tau; \theta) = \mu(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) P(s_{t+1}|s_t, a_t)\]Taking the log decomposes the product into a sum:
\[\log P(\tau; \theta) = \log \mu(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t|s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1}|s_t, a_t)\] -
Calculate the Gradient: Since the environment dynamics terms (\(\log \mu(s_0)\) and \(\log P(s_{t+1}|s_t, a_t)\)) do not depend on \(\theta\), their gradients are zero. Thus:
\[ \nabla_\theta \log P(\tau; \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \]
-
-
Final Update Rule: We approximate the expectation by sampling \(m\) trajectories:
\[ \nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^m R(\tau_i) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^i | s_t^i) \]- Key Insight: The gradient depends only on the policy gradients. This allows us to maximize the expected reward without knowing the dynamics model \(P(s'|s,a)\) (Model-Free).
REINFORCE Algorithm
REINFORCE is a Monte-Carlo Policy Gradient algorithm. 1. Sample episode trajectories following \(\pi_\theta\). 2. Update \(\theta\) using the estimated gradient:
$$
\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \ln \pi(A_t | S_t, \theta)
$$
where $G_t$ is the return from step $t$.
Variance Reduction
The raw Policy Gradient has high variance. We use two main techniques to reduce it:
A. Causality (Temporal Structure)
Rewards from the past cannot be affected by actions in the future. Instead of using the full trajectory return \(R(\tau)\), we use the return-to-go \(G_t\):
B. Baselines
We can subtract a baseline \(b(s)\) from the return. This does not change the expected gradient (unbiased) but significantly reduces variance.
- A good baseline is the state value function: \(b(s_t) \approx V(s_t)\).
Actor-Critic
To further reduce variance (at the cost of some bias), we replace the noisy MC return \(G_t\) with a value function estimate (Critic).
- Actor: The policy \(\pi_\theta(a|s)\) (updates \(\theta\)).
- Critic: The value function \(Q_w(s,a)\) or \(V_w(s)\) (updates \(w\)).
Advantage Actor-Critic: Using the Advantage Function \(A(s,a)\) combines the Critic with a Baseline:
The gradient becomes:
N-step Estimators: We can blend TD and MC methods using N-step returns to trade off bias and variance: * 1-step (TD): Low variance, High bias. * \(\infty\)-step (MC): High variance, Low bias.