Trust Region Policy Optimization-Literature Survey

Trust Region Policy Optimization[paper] is on-policy learning with giving constraint on policy. There are a lot of method to address high-dimensional space decision problem in robotics and game domain. The main idea is using function approximation instead of matrix or table. Even though policy gradient methods have been succeeded, it can not improve policy efficiently.
The following post illustrates 2002 [paper] first and then covers TRPO paper.

Policy Improvement(2002 $\rightarrow$ 2015)

The concepts of both paper are similar. Both paper focus on policy improvement rather than maximize policy objective itself. This formulation guarantees steady improvement in 2002 paper. Also, they prove the boundness and effectiveness of surrogate objective function with theoretical and experimental result.

Policy Objective(surrogate objective)

In time repersentation $\eta(\pi) = E_{s_0,a_0,s_1,...}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$
In state representation

$\eta(\tilde\pi) = E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$, $\,\,\eta(\pi) = E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi)-E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})] + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi)-V_{\pi}(s_0) + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[-V_{\pi}(s_0)+\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[-V_{\pi}(s_0)+r(s_0)+\gamma V_{\pi}(s_1)+\gamma\{-V_{\pi}(s_1)+r(s_1)+\gamma V_{\pi}(s_2)\}…]$

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]$

$\eta(\tilde\pi) = \eta(\pi) + \sum_{t=0}^{\infty}\gamma^t\sum_sP(s_t=s;\tilde\pi)\sum_a\tilde\pi(a_t|s_t) A_{\pi}(s_t,a_t)$

Result

$\eta(\tilde\pi) = \eta(\pi)+\sum_{s}\rho_{\tilde\pi}(s)\sum_{a}\tilde\pi(a|s)A_{\pi}(s,a)$

It is hard to formulate equation under $\rho_{\tilde\pi}(s)$. In general policy iteration method, policy evaluation is followed by policy improvenment. After improvement, the agent has not experienced or rolled out with new policy so that new discounted unnormalized visitation frequency has not formed yet. The main idea in 2002 and 2015 literature was, instead of using next policy state disturibution(unnormalized), using previous one.

$L_{\pi}(\tilde\pi) = \eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\tilde\pi(a|s)A_{\pi}(s,a)$ $\pi'\in arg\max_{\pi'}L_{\pi}(\pi')$ $\pi_{new}=(1-\alpha){\pi_{old}}+\alpha{\pi'}$

They suggest that it can not gaurantee direct maximizing the advantage function is equal to improvement in policy. This is because the advantage in practice is parameterized, which causes estimation error and approximation error at the same time. Instead, they use conservative policy iteration update so that they could provide explicit lower bounds on the improvement of η.

Boundness

Let’s go with 2002 approach before we dive into 2015 approach which little bit changes in policy improvement.

Properties & Condition

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]$

If $\tilde\pi=\pi$

$E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]=0$

$\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi}(s,a)=0$

$\sum_{a}\pi(a|s)A_{\pi}(s,a)=0$

$\epsilon_{old} = \max_s|E_{a\sim\pi’}A_{\pi}(s,a)|\geq |E_{a\sim\pi’}A_{\pi}(s,a)|$

Derivation

$\eta(\pi_{new}) = \eta(\pi_{old}) + E_{\tau\sim\pi_{new}}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\{(1-\alpha)\pi_{old}(a|s)+\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_sP(s_t=s;\pi_{new})\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\{(1-\alpha)^t P(s_t=s;\pi_{old\,only})+(1-(1-\alpha)^t) P(s_t=s;\pi_{rest})\}\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
Let $r_a$ denotes $1-(1-\alpha)^t$,

$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\{(1-r_a) P(s_t=s;\pi_{old\,only})+r_a P(s_t=s;\pi_{rest})\}\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = L_{\pi_{old}}(\pi_{new}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\{-r_a P(s_t=s;\pi_{old\,only})+r_a P(s_t=s;\pi_{rest})\}\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = L_{\pi_{old}}(\pi_{new}) - \sum_{t=0}^{\infty}\gamma^t2r_{a}\epsilon_{old}$

$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) -\frac{2\alpha^2\epsilon_{old}\gamma}{(1-\gamma)^2}$

This inequality condition means that if we maximize $L_{\pi_{old}}(\pi_{new})$, it gaurantees policy improvements with error term.

Let’s go with 2015 approach. They changes $\epsilon$ definition and give more generality while having more error term. We denotes $\epsilon$ as $\epsilon_{new}$

$\epsilon_{old} = \max_{s}\vert E_{a\sim\pi’} A_{\pi}(s,a)\vert = \max_s\vert\sum_a\pi(a|s)A_{\pi}(s,a)-\sum_a\pi’(a|s)A_{\pi}(s,a)\vert
\leq 2*\max_{s,a}\vert A_{\pi}(s,a) \vert =2\epsilon_{new}$

$\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) -\frac{4\alpha^2\epsilon_{new}\gamma}{(1-\gamma)^2}$ $\eta(\pi_{new}) \geq L_{\pi_{old}}(\pi_{new}) -C * D_{KL}^{max}(\pi_{new}||\pi_{old})$ $where\,\, C\,=\, \frac{4\epsilon_{new}\gamma}{(1-\gamma)^2},\,\,\alpha\,=\,D_{TV}^{max}(\pi_{new}||\pi_{old})$

Toward Theory to Practical Implementation form

Change Boundness to Constant
In practice, policy($\pi$) is parameteriezed by $\theta$. If we follow theoretical step($C$), it would be small. The literature recommands to choose $\delta$, which changes the optimization problem. $maximize_{\theta}\:L_{\theta_{old}}(\theta)$ $subject\:to\:D_{KL}^{max}(\theta||\theta_{old})\leq\delta$
Exact method to Heuristic approximation
It is impractical to calculate $D_{KL}^{max}$ at each iteration. Instead, they choose heuristic approximation($D_{KL}^{\rho}=E[D_{KL}]$) $maximize_{\theta}\:L_{\theta_{old}}(\theta)$ $subject\:to\:D_{KL}^{\rho}(\theta||\theta_{old})\leq\delta$
Expectation becomes Sampled Sum
In this section, sampled sum which we call Monte-Carlo simulation replaces expecation

$L_{\pi}(\pi_{new}) = \eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)$

$\sum_{s}\rho_{\pi}(s)\:\rightarrow\:\frac{1}{1-\gamma}E_{s\sim\rho_{old}}[*]$
$A_{\pi_{old}}(s,a)\:\rightarrow\:Q_{\pi_{old}}(s,a)$

Importance of sampling($q$ sampling distribution)

$\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)\:\rightarrow\:E_{a\sim q}[\frac{\pi_{new}}{q}A_{\pi_{old}}(s,a)]$

$maximize_{\theta}\:E_{a\sim q,\,s\sim \rho_{old}}g[\frac{\pi_{new}}{q}Q_{\pi_{old}}(s,a)g]$ $subject\:to\:D_{KL}^{\rho}(\theta||\theta_{old})\leq\delta$

Linear Approximation of Objective Function, Quadratic Approximation of Constraint

$maximize_{\theta}\:\nabla L \Delta\theta$ $subject\:to\:\frac{1}{2}\Delta\theta^T A \Delta\theta\leq\delta$

where, $\:A\,$ is Fisherman Information Matrix(FIM), also Hessian of $D_{KL}$

Details(Optional)

Lagragian Method

$F(\Delta\theta,\,\lambda)\,=\,\nabla L^T\,\Delta\theta\,-\,\lambda(\frac{1}{2}\Delta\theta^TA \Delta\theta\,-\,\delta)$ $\nabla F(\Delta\theta,\,\lambda)=0$

This yields to two equation

$\nabla L - \lambda A \Delta\theta=0 ---- (1)$ $\Delta\theta=\frac 1 \lambda A^{-1}\nabla L$ $\lambda(\frac{1}{2}\Delta\theta^T A \Delta\theta-\delta)=0 ---- (2)$

By KKT condition, $\lambda$ should be positive and $\delta$ should be $\frac{1}{2}\Delta\theta^T A \Delta\theta$.
Use second condition and equation 1

$\delta=\frac{1}{2}\Delta\theta^T A \Delta\theta=\frac{1}{2\lambda^2}\nabla L^T A^{-1}\nabla L$

$\frac{1}{\lambda} = \sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}$

Finally, we get update law and maximum value

$\Delta\theta = \theta-\theta_{old}= {\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}A^{-1}\nabla L $

$A^{-1}\nabla L $ denotes search direction.

${\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}$ indicates step size.

$ \nabla L^T \Delta\theta = {\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}{\nabla L^T A^{-1}\nabla L}$

Conjugate Gradient Method

This is iterative method to solve $Ax=b$ or $minimize_{x}\: \Phi(x)=\frac{1}{2}x^TAx-b^Tx$.
where, $A$ is positive definite.
In previous section, we need $A^{-1}\nabla L$ to calculate update law. However, it is comuputationally high to get inverse of hessian($A^{-1}$). Instead, we compute hessian vector product iteratively.
In this case $Ax=g$, where $g$ is gradient of objective function($\nabla L$)

Derivation(One-Minute Derivation of The Conjugate Gradient Algorithm)

we want to solve for $Ax^＊=g$

$x_{i+1}=x_{i}+\alpha_i d_i$

$x_{i+1}-x^＊= x_{i}-x^＊+\alpha_i d_i$

Let, $e_i=x_i-x^＊$

$e_{i+1} = e_{i}+\alpha_i d_i$

$Ae_{i+1} = Ae_{i}+\alpha_i Ad_i$

Let, $r_i=g-Ax_i$

$r_{i+1} = r_{i}-\alpha_i Ad_i$

Let, we choose next residual perpendicular to previous one $r_{i+1}^Tr_i=r_{i}^Tr_{i+1}=0$.

$r_{i}^Tr_{i+1} = r_{i}^Tr_{i}-\alpha_i\, r_{i}^TAd_i=0$

$\alpha_i = \frac{r_{i}^Tr_{i}}{r_{i}^TAd_i}$

We have to calculate $d_i$ iteratively.

In fact, another approach to explain this part is A-conjugate or A-orthogonal. Vector $d_i$ is n-dimensional, which is basis of A. You can find Gram Schmidt Orthogonalization if you need more information. Briefly, we can build n orthogonal basis vectors from any vector $d_0$ if A is invertible(non-zero eigen values).

$d_{i}^TAd_{i+1}=0$

Let

$d_{i+1} = r_{i+1} + \beta_{i+1}d_{i}$

$d_{i}^TAd_{i+1} = d_{i}^TAr_{i+1} + \beta_{i+1}d_{i}^TAd_{i}$

$0 = d_{i}^TAr_{i+1} + \beta_{i+1}d_{i}^TAd_{i}$

$\beta_{i+1} = -\frac{d_{i}^TAr_{i+1}} {d_{i}^TAd_{i}}$

$r_{i+1}^T = r_{i}^T-\alpha_id_i^T A\,\,\,$

$r_{i+1}^Tr_{i+1} = r_{i}^Tr_{i+1}-\alpha_id_i^T Ar_{i+1}\,\,\,$

$d_i^T Ar_{i+1}=-\frac{1}{\alpha_i}r_{i+1}^Tr_{i+1}\,\,\,$

$d_{i} = r_{i} + \beta_{i}d_{i-1}$

$Ad_{i} = Ar_{i} + \beta_{i}Ad_{i-1}$

$d_{i}^TAd_{i} = d_{i}^TAr_{i} + \beta_{i}d_{i}^TAd_{i-1}$

$d_{i}^TAd_{i} = d_{i}^TAr_{i}$

$\beta_{i+1} = -\frac{d_{i}^TAr_{i+1}}{d_{i}^TAd_{i}} = \frac{1}{\alpha_i}\frac{r_{i+1}^Tr_{i+1}}{d_{i}^TAd_{i}}= \frac{r_{i+1}^Tr_{i+1}}{r_{i}^Tr_{i}}$

Algorithm

Initial guess $x_0$
Set initial condition $d_0=r_0=b-Ax_0$
Repeat until $r_0$ is sufficiently small(gradient of objective is near zero or linear system is near solution)
- compute $\alpha$ $\alpha_i = \frac{r_{i}^Tr_{i}}{r_{i}^TAd_i}\,or\,\frac{r_{i}^Tr_{i}}{d_{i}^TAd_i}$
- update $x$ $x_{i+1}=x_{i}+\alpha_i d_i$
- update $r$ $r_{i+1} = r_{i}-\alpha_i Ad_i$
- compute $\beta$ $\beta_{i+1} = \frac{r_{i+1}^Tr_{i+1}} {r_{i}^Tr_{i}}$
- update $d$ $d_{i+1} = r_{i+1} + \beta_{i+1}d_{i}$
Return $x_n$

How to compute $Ad_i$

We use conjugate gradient method to avoid caculating inverse of hessian matrix. However, there is one problem left. We need A matrix in algorithm, which is KL divergence of hessian. It is well-know that caculating gradient is computationally cheap than getting hessian itself. Therfore, we use under equation.

$\nabla^2 KL\: d_i = \nabla(\nabla KL\: d_i)$

where $\nabla^2 KL$ is $A$

To be specific

Get $KL(\pi,\pi_{old})$
Use auto grad package to get $\nabla KL(\pi,\pi_{old})$
Get $\nabla KL(\pi,\pi_{old})d_{i}$
Again, use auto grad package to get $\nabla(\nabla KL(\pi,\pi_{old})d_{i})$

Now, you get $Ad_i$ indirectly

Summarry

Compute search direction($A^{-1}\,\nabla L$) using conjugate gradient method(n-step)
- Input : initial guess of $x_0$ ex. zero vector and gradient of objective function
- In algorithm, use hessian vector prouct to get $Ad_i$ indirecty
Get step size Using equation
Update parmeter Vector($\theta$)

$\theta= \theta_{old}+{\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}A^{-1}\nabla L$