Adavantage Weighted Regression-Literature Survey

This post is to summary recent RL algorithm called Adavantage Weighted Regression(AWR) (paper). Detail derivation and explanation are added to help understand deeply.
※Explanation may not be accurate. Readers should read this post carefully.

Contribution(personal)

This algorithm improves Reward Weighted Regression by using policy improvement instead of direct policy maximization.
It is off-policy algorithm that has higher sample efficiency than on-policy one.
It can learn policy from static expert data set without collecting or sampling data from environment like behavior cloning.

Preliminaries

As always, we want to find policy that maximize return(sum of discounted rewards). We can represent objective as time or state and action.

$J(\pi)=E_{\tau\sim p_{\pi}(\tau)}\Big[\sum_{t=0}^{\infty}\gamma^t r_t\Big]$ $J(\pi)=\sum_{t=0}^{\infty}\gamma^t\int_{s}p(s_t = s\vert\pi)\int_{a}\pi(a\vert s)r(s,a)dads$ $J(\pi)=\int_{s}\sum_{t=0}^{\infty}\gamma^tp(s_t = s\vert\pi)\int_{a}\pi(a\vert s)r(s,a)dads$ $J(\pi)=\int_{s}d^{\pi}(s)\int_{a}\pi(a\vert s)r(s,a)dads$ $J(\pi)=E_{s\sim d^{\pi}(s)}\Big[E_{a\sim\pi(a\vert s)}\Big[r(s,a)\Big]\Big]$

AWR Objective & Derivation

As mentioned in contribution section, AWR algorithm maximizes policy improvement. In this equation, it is impossible to get expectation under discounted state distribution following policy($\pi$). According to 2002(Sham Kakade and John Langford) and 2015(TRPO) paper, expectation under sampling policy is tractable and it is approximate of true policy improvement with small error term(boundness).

$\eta(\pi)=J(\pi)-J(\mu)$ $\eta(\pi)=E_{\tau\sim p_{\pi}(\tau)}\Big[\sum_{t=0}^{\infty}\gamma^t r_t\Big]-E_{s_0}\Big[V^{\mu}(s_0)\Big]$ $\eta(\pi)=E_{\tau\sim p_{\pi}(\tau)}\Big[\sum_{t=0}^{\infty}\gamma^t r_t-V^{\mu}(s_0)\Big]$ $\eta(\pi)=E_{\tau\sim p_{\pi}(\tau)}\Big[\sum_{t=0}^{\infty}\gamma^t (r_t+V^{\mu}(s_{t+1})-V^{\mu}(s_t)\Big]$ $\eta(\pi)=E_{\tau\sim p_{\pi}(\tau)}\Big[\sum_{t=0}^{\infty}\gamma^t A^\mu(s_t,a_t)\Big]$

As same derivation in preliminaries section, we can get equation under state, action expectation.

$\eta(\pi)=E_{s\sim d^{\pi}(s)}\Big[E_{a\sim\pi(a\vert s)}\Big[A^\mu(s,a)\Big]\Big]$

“The objective can be difficult to optimize due to the dependency between $d^{\pi}(s)$ and $\pi$, as well as the need to collect samples from $\pi$” - In the paper

$\hat{\eta}(\pi)=E_{s\sim d^{\mu}(s)}\Big[E_{a\sim\pi(a\vert s)}\Big[A^\mu(s,a)\Big]\Big]$

we can consider this optimization problem as constrained policy search. This is because, according to early paper(2015 TRPO), $\hat{\eta}(\pi)$ is guarantee only when $\pi$ ans $\mu$ are closed enough(the closeness in probability is defined as KL-divergence).

$arg\max_{\pi}=\hat{\eta}(\pi)$ $s.t\quad\int_s d^{\mu}(s)D_{KL}(\pi(\cdot|s)\vert\vert\mu(\cdot|s))ds\leq\epsilon$ $\int_{a}\pi(a\vert s)da = 1$

we can re-write this equation in soft constrained form using Langrangian.

$L(\pi,\beta,\alpha_s)=\hat{\eta}(\pi)+\beta (\epsilon-\int_s d^{\mu}(s)D_{KL}(\pi(\cdot|s)\vert\vert\mu(\cdot|s))ds)+\int_{a}\alpha_s(1-\pi(a\vert s))da$ $\frac{\partial L(\pi,\beta,\alpha_s)}{\partial\pi(a\vert s)} = 0$ $\pi^{*}=\frac{1}{Z(s)}\mu(a\vert s) \exp(\frac{1}{\beta}A^{\mu}(s,a))$

where $Z(s)$ is normalized constant to make $\pi^{*}$ sum up to 1

we want $\pi$ close enough to $\pi^*$ in terms of KL-divergence. we can write this in to optimization problem.

$arg\min_{\pi}\int_{s}d^\mu (s)D_{KL}(\pi^*(\cdot|s)\vert\vert\pi(\cdot|s))ds$

Following the definition of KL-divergence you can easily get under problem.

$arg\max_{\pi}\int_{s}d^\mu (s)\int_{a}\mu(a\vert s) \log\pi(a\vert s)\exp(\frac{1}{\beta}A^{\mu}(s,a))ds$

Off-policy Learning with Experience Replay

On-policy learning uses behavior policy(sampling policy) only at k-th iteration trajectory($\tau$) data, which is inefficient. This is because we throw away hole data that collected from previous iterations. Instead, AWR uses hole data in replay buffer($D$). However, state distribution and policy at each iteration are different. This makes expectation of current policy impossible. In AWR derivation, this algorithm considers samples from replay buffer as prior policy that mixture of 1~k iterations.

Algorithm

Start from random policy(ex.generated by initial weights in policy network).
At each iteration sample trajectory following current policy($\pi_k$).
Uniformly select $N$ samples from replay buffer($D$).
Update state value function($TD(\lambda)$ as target semi-gradient descent).
Update policy following upper objective.

Summary

In the paper, there are sections that I do not cover in this post. For specific, the author(Jason Peng) is famous for “deepmimic” which agent learns agile skills from mocap data using RL. He suggests that AWR is better performance than Proximal Policy optimization and Reward Weighted Regression algorithm in terms of fast convergence when do motion imitation tasks. Also, AWR learns from fully static data that collects form expert. He also compares AWR with others that can learn or cloning expert policy in fully off-line manner.