RL Core

This post is to introduce the key concept of reinforcement learning with math. When you want to read and understand recent RL papers, it is sometimes hard to follow the equation under expectation. I hope that this article paves the way to deeper insight towards RL. If you are interested in this topic, it is recommended to read Suttan & Burto’s book(introduction to reinforcement learning) or David Silver’s lecture.

What is Reinforcment Learning(RL)?

RL is one branch of machine learning. In contrast to other machine learning such as supervised and unsupervised learning, it is based on try and error method that collects data by interacting with environment also called world.

RL is built on Markov Decision Process(MDP) which means that next state only depends on current states(Markov Property). MDP in RL has 5 elements: states($S$), action($A$), transition probability($P$), discount factor($\gamma$), and reward($R$).

$P(s_{t+1}|s_{t},a_{t},s_{t-1},a_{t_1} ... )=P(s_{t+1} | s_t, a_t)$

The Goal of RL is to find the policy that maximize cumulative(sum of) discounted rewards.
※ Sum of rewards depends on which problem you want to solve. Ex. one is episodic task, the other is infinite time horizon task.

$J(\pi) = E\Big[\sum_{t=0}^{T}\gamma^tr_t\Big]$

Value Function & Bellman Equation

There are two kinds of value function. One is state value function($V(s)$) and the other is action value function($Q(s,a)$).
State value function denotes how much return(sum of rewards) would get from this state($s$).

$V(s) = E\Big[\sum_{k=0}^{T}\gamma^kr_{t+k}|s_t = s\Big]$

Action value function means how much return would get from this state when choosing particular action($a$)

$Q(s,a) = E\Big[\sum_{k=0}^{T}\gamma^kr_{t+k}|s_t = s,a_t = a\Big]$

We can also write the equation using dynamic programming which writes the equation in relationship between current and next.

$V(s) = E_{a\sim\pi}\Big[Q(s,a)\Big]$ $V(s) = E_{a\sim\pi}\Big[E_{s'\sim P}\Big[r(s,a)+\gamma V(s')\Big]\Big]$ $Q(s,a) = E_{s'\sim P}\Big[r(s,a)+\gamma V(s')\Big]$ $Q(s,a) = E_{s'\sim P}\Big[r(s,a)+\gamma E_{a'\sim\pi}\Big[Q(s',a')\Big]\Big]$

Bellman Optimality Equation

Key idea is to choose the policy greedly that maximize value function. This means that policy value which is probability of choosing action at state is 1 for max value function.

$V(s) = E_{a\sim\pi}\Big[E_{s'\sim P}\Big[r(s,a)+\gamma V(s')\Big]\Big]$ $\downarrow$ $V^{*}(s) = \max_{a}\Big(E_{s'\sim P}\Big[r(s,a)+\gamma V^{*}(s')\Big]\Big)$ $Q(s,a) = E_{s'\sim P}\Big[r(s,a)+\gamma E_{a'\sim\pi}\Big[Q(s',a')\Big]\Big]$ $\downarrow$ $Q^{*}(s,a) = E_{s'\sim P}\Big[r(s,a)+\gamma \max_a\Big(Q^{*}(s',a')\Big)\Big]$

Policy Evaluation

Use Bellman Expectation Equation to update current value function(state or action) by looking ahead
Value function indicates how good current policy is

Value Iteration

Use Bellman Optimality Equation to update current value function(state or action) by looking ahead
To find optimal policy and optimal value function
After few iterations, Value function is converge. We can use this value function as policy, which means choosing action that gives highest value function.
How to find optimal policy?
- For state value function, we can use optimality equation again to make $Q^{＊}(s,a)$ $\pi(a\vert s) = arg \max_{a}Q^{＊}(s,a) = arg\max_{a}\Big(r(s,a)+\gamma E_{s'\sim P}[V^{＊}(s')] \Big)$
- For action value function, we just use $Q^{＊}(s,a)$ to decide which action to take. $\pi(a\vert s) = arg \max_{a}Q^{＊}(s,a)$ ※Convergence is guaranteed for discrete state and action case with discounted reward following contraction mapping theorem.

Policy Iteration

In contrast to Value Iteration, Policy Iteration chooses greedy action according to action value function at each iteration.
Algorithm
- for $1$ : $K$ iterate
  1. Estimate Value function using Bellman Equation(Policy Evaluation) $Q(s,a) = E_{s'\sim P}\Big[r(s,a)+\gamma E_{a'\sim\pi}\Big[Q(s',a')\Big]\Big]$
  2. Choose action greedily $\pi(a\vert s) = arg \max_{a}Q^{＊}(s,a)$

Summary & Limitation

Almost all of current SOTA in deep reinforcement learning is based on 3 basic algorithms(policy evaluation, value iteration, and policy iteration). For example, we can see Deep Q Network(DQN) as policy iteration. This is because after we collect samples from black box environment, DQN algorithm fits Q function to reduce bellman error(the sqaured error of Bellman Equation). Then, we choose eplison-greedy action based on updated Q function. However, there are limitations directly using above algorithms.

Unknown Transition Probability
First, we do not know explicit transition probability. One way to address this problem is to use model based reinforcement learning, which it defines or makes model of transition probability. After that, we can predict or plan the trajectory(states and action sequence) based on model. Another popular method is model-free reinforcement learning, which it learns policy, value function, or both end-to-end manner from sample. This method does not care modeling transition probability. It assumes transition probability as black box model.
Curse of Dimensionality
In practice, most problems have high-dimensional action and state space. It is hard to deal with huge matrix information. For specific, when we have $N$ states and $M$ actions. We need large values for each components: $N^2M$ for transition probability, $NM$ for policy, $NM$ for policy, $NM$ for action value function, and $N$ for state value function. When state and action space are continuous, it is almost impossible to discretize the space. Therefore, we can not full back-up hole MDP tree or look ahead all state and action space. To deal with this, we use samples to estimate value function such as Monte Carlo estimation or Temporal Difference Learning(TD).