0%

This post is to derive popular optimal control framework called Linear Quadratic Regulator(LQR) and other variants. There are lots of method to derive LQR Algorihm. However, we will follow reinforcement learning perspective.

1. Notation

  • State at time $t$: $x_t$
  • Action(Input) at time $t$: $u_t$
  • Policy: $\pi(u_t \vert x_t)$
  • Cost to go at time t and state $x_t$: $V_{t}(x_t)$

2. Objective

Optimal control is to find sequential action or policy(stochastic or deterministic) from particular state that minimized objective.

3. Value iteration

Similar to value iteration in reinforcement learning, we can apply bellman backup to update current value function(Cost to go).

Let, $V_{t}(x_t) = \min_{u_{t:T-1}} \Big[x_{T}^TQ_fx_T+\sum_{\tau=t}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}\Big]$

Then, $V_{t+1}(x_{t+1}) = \min_{u_{t+1:T-1}} \Big[x_{T}^TQ_fx_T+\sum_{\tau=t+1}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}\Big]$

we can formulate recursive equation.

the deterministic policy is now become,

4. Explicit Policy

We can also assume the value function(cost to go) is quadratic by exploiting the property of cost function($J$).

Let, $V_{t}(x_t) = x_t^TP_tx_t+q_t$, where $P_t$ is matrix and $q_t$ is scalar.

From deterministic policy in section 3, we can substitute cost to go function and get gradient w.r.t $u_t$.
$F(x_t,u_t) = (Ax_t+Bu_t)^TP_{t+1}(Ax_t+Bu_t)+q_{t+1} + x_{t}^TQx_{t} + u_{t}^TRu_{t}$

$\nabla_{u_{t}} F(x_t,u_t) = 2Ru_t + 2B^TP_{t+1}(Ax_t+Bu_t) = 0$

We can re-write the recursive equation in value iteration by substituting cost to go function with quadratic function.

Initial Condition
$P_T = Q_f$

$q_T = 0$

For t=T-1:0

  1. $P_{t}=(A+BK_t)^TP_{t+1}(A+BK_t)+K_t^TRK_t+Q$

    where,$\,K_t = -(B^TP_{t+1}B+R)B^TP_{t+1}A$

  2. $q_{t}=q_{t+1}$

Optimal action(input) at time t: $u_t^{*}=K_tx_t$

Cost to go at time t: $V_{t}(x_t) = x_t^TP_tx_t+q_t$

5. Variants

Linear Time Invariant System with Infinite horizon control.

Iterating $K_t$ matrix to converge or solve Riccati equation.
Then, use $u_{t} = K_{ss}x_t$ to control each step.

Linear dynamics with constant(Affine system)

$x_{t+1} = Ax_{t}+Bu_{t} + c$

Let’s make augmented state.
$\begin{bmatrix}x_{t+1}\\\\1\end{bmatrix} = \begin{bmatrix}A&c\\\\O&1\end{bmatrix}\begin{bmatrix}x_{t}\\\\1\end{bmatrix}+\begin{bmatrix}B\\\\O\end{bmatrix}u_t$

Then, following same derivation above you will get.
$u_t = K_tz_t$, where $z_t = \begin{bmatrix}x_{t}\\\\1\end{bmatrix}$

Linear dynamics with noise(stochastic dynamics)

$x_{t+1} = Ax_{t}+Bu_{t} + w_t$, where $E[w_t]=0$ and $E[w_t^Tw_t]=\Sigma_w$

$P_{t}=(A+BK_t)^TP_{t+1}(A+BK_t)+K_t^TRK_t+Q$, which is same as deterministic case.

$q_{t} = E[w_t^TP_{t+1}w_t]+q_{t+1} = Tr(WP_{t+1}) + q_{t+1}$

Control is also same as deterministic case. $u_t = K_tx_t$

Linear Time Variant system

Change $A$ and $B$ to correspond time step matrix(ex. $A_t$ and $B_t$)

Penalization for change in control inputs

From linear system, $x_{t+1} = Ax_{t}+Bu_{t}$

$x_{t+1} = Ax_{t}+Bu_{t-1}+Bu_{t}-Bu_{t-1}$
$u_{t} =u_{t-1} + u_{t}-u_{t-1}$

Trajectory following for non-linear systems(also applied to non-linear system stabilization)

Let, non-linear system as $x_{t+1} = f(x_t, u_t)$.

Using partial differential w.r.t ($x_t^{ref}$, $u_t^{ref}$) and 1st-order approximation

Subtract next reference state at both side.

We can make approximated linearized affine system with

Transformed state: $z_t = \begin{bmatrix}x_{t} - x_{t}^{ref}\\\\1\end{bmatrix}$

Transformed input: $v_t = u_{t} - u_{t}^{ref}$

where,

Now, we get control policy(feedback law).

Reference

This post is based on Next theme in hexo.

Create new post

1
2
hexo new <layout> title
#ex) hexo new post tutorial1

Deploy and generate

1
hexo d -g

Local server

1
hexo s

Trust Region Policy Optimization[paper] is on-policy learning with giving constraint on policy. There are a lot of method to address high-dimensional space decision problem in robotics and game domain. The main idea is using function approximation instead of matrix or table. Even though policy gradient methods have been succeeded, it can not improve policy efficiently.
The following post illustrates 2002 [paper] first and then covers TRPO paper.

Policy Improvement(2002 $\rightarrow$ 2015)

The concepts of both paper are similar. Both paper focus on policy improvement rather than maximize policy objective itself. This formulation guarantees steady improvement in 2002 paper. Also, they prove the boundness and effectiveness of surrogate objective function with theoretical and experimental result.

Policy Objective(surrogate objective)

  • In time repersentation
  • In state representation

$\eta(\tilde\pi) = E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$, $\,\,\eta(\pi) = E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi)-E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})] + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi)-V_{\pi}(s_0) + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[-V_{\pi}(s_0)+\sum_{t=0}^{\infty}\gamma^{t}r(s_{t})]$

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[-V_{\pi}(s_0)+r(s_0)+\gamma V_{\pi}(s_1)+\gamma\{-V_{\pi}(s_1)+r(s_1)+\gamma V_{\pi}(s_2)\}…]$

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]$

$\eta(\tilde\pi) = \eta(\pi) + \sum_{t=0}^{\infty}\gamma^t\sum_sP(s_t=s;\tilde\pi)\sum_a\tilde\pi(a_t|s_t) A_{\pi}(s_t,a_t)$

Result

It is hard to formulate equation under $\rho_{\tilde\pi}(s)$. In general policy iteration method, policy evaluation is followed by policy improvenment. After improvement, the agent has not experienced or rolled out with new policy so that new discounted unnormalized visitation frequency has not formed yet. The main idea in 2002 and 2015 literature was, instead of using next policy state disturibution(unnormalized), using previous one.

They suggest that it can not gaurantee direct maximizing the advantage function is equal to improvement in policy. This is because the advantage in practice is parameterized, which causes estimation error and approximation error at the same time. Instead, they use conservative policy iteration update so that they could provide explicit lower bounds on the improvement of η.

Boundness

Let’s go with 2002 approach before we dive into 2015 approach which little bit changes in policy improvement.

  • Properties & Condition

$\eta(\tilde\pi) = \eta(\pi) + E_{\tau\sim\tilde\pi}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]$

If $\tilde\pi=\pi$

$E_{\tau\sim\pi}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]=0$

$\sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi}(s,a)=0$

$\sum_{a}\pi(a|s)A_{\pi}(s,a)=0$

$\epsilon_{old} = \max_s|E_{a\sim\pi’}A_{\pi}(s,a)|\geq |E_{a\sim\pi’}A_{\pi}(s,a)|$

  • Derivation

$\eta(\pi_{new}) = \eta(\pi_{old}) + E_{\tau\sim\pi_{new}}[\sum_{t=0}^{\infty}\gamma^t A_{\pi}(s_t,a_t)]$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\{(1-\alpha)\pi_{old}(a|s)+\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi_{new}}(s)\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_sP(s_t=s;\pi_{new})\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\{(1-\alpha)^t P(s_t=s;\pi_{old\,only})+(1-(1-\alpha)^t) P(s_t=s;\pi_{rest})\}\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
Let $r_a$ denotes $1-(1-\alpha)^t$,

$\eta(\pi_{new}) = \eta(\pi_{old}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\{(1-r_a) P(s_t=s;\pi_{old\,only})+r_a P(s_t=s;\pi_{rest})\}\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = L_{\pi_{old}}(\pi_{new}) + \sum_{t=0}^{\infty}\gamma^t\sum_s\{-r_a P(s_t=s;\pi_{old\,only})+r_a P(s_t=s;\pi_{rest})\}\sum_{a}\{\alpha\pi’(a|s)\}A_{\pi_{old}}(s,a)$
$\eta(\pi_{new}) = L_{\pi_{old}}(\pi_{new}) - \sum_{t=0}^{\infty}\gamma^t2r_{a}\epsilon_{old}$

This inequality condition means that if we maximize $L_{\pi_{old}}(\pi_{new})$, it gaurantees policy improvements with error term.

Let’s go with 2015 approach. They changes $\epsilon$ definition and give more generality while having more error term. We denotes $\epsilon$ as $\epsilon_{new}$

$\epsilon_{old} = \max_{s}\vert E_{a\sim\pi’} A_{\pi}(s,a)\vert = \max_s\vert\sum_a\pi(a|s)A_{\pi}(s,a)-\sum_a\pi’(a|s)A_{\pi}(s,a)\vert
\leq 2*\max_{s,a}\vert A_{\pi}(s,a) \vert =2\epsilon_{new}$

Toward Theory to Practical Implementation form
  • Change Boundness to Constant
    In practice, policy($\pi$) is parameteriezed by $\theta$. If we follow theoretical step($C$), it would be small. The literature recommands to choose $\delta$, which changes the optimization problem.
  • Exact method to Heuristic approximation
    It is impractical to calculate $D_{KL}^{max}$ at each iteration. Instead, they choose heuristic approximation($D_{KL}^{\rho}=E[D_{KL}]$)
  • Expectation becomes Sampled Sum
    In this section, sampled sum which we call Monte-Carlo simulation replaces expecation

$L_{\pi}(\pi_{new}) = \eta(\pi_{old})+\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)$

$\sum_{s}\rho_{\pi}(s)\:\rightarrow\:\frac{1}{1-\gamma}E_{s\sim\rho_{old}}[*]$
$A_{\pi_{old}}(s,a)\:\rightarrow\:Q_{\pi_{old}}(s,a)$

Importance of sampling($q$ sampling distribution)

$\sum_{a}\pi_{new}(a|s)A_{\pi_{old}}(s,a)\:\rightarrow\:E_{a\sim q}[\frac{\pi_{new}}{q}A_{\pi_{old}}(s,a)]$

  • Linear Approximation of Objective Function, Quadratic Approximation of Constraint

where, $\:A\,$ is Fisherman Information Matrix(FIM), also Hessian of $D_{KL}$

Details(Optional)

Lagragian Method

This yields to two equation

By KKT condition, $\lambda$ should be positive and $\delta$ should be $\frac{1}{2}\Delta\theta^T A \Delta\theta$.
Use second condition and equation 1

$\delta=\frac{1}{2}\Delta\theta^T A \Delta\theta=\frac{1}{2\lambda^2}\nabla L^T A^{-1}\nabla L$

$\frac{1}{\lambda} = \sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}$

Finally, we get update law and maximum value

$\Delta\theta = \theta-\theta_{old}= {\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}A^{-1}\nabla L $

$A^{-1}\nabla L $ denotes search direction.

${\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}$ indicates step size.

$ \nabla L^T \Delta\theta = {\sqrt{\frac{2\delta}{\nabla L^T A^{-1}\nabla L}}}{\nabla L^T A^{-1}\nabla L}$

Conjugate Gradient Method

This is iterative method to solve $Ax=b$ or $minimize_{x}\: \Phi(x)=\frac{1}{2}x^TAx-b^Tx$.
where, $A$ is positive definite.
In previous section, we need $A^{-1}\nabla L$ to calculate update law. However, it is comuputationally high to get inverse of hessian($A^{-1}$). Instead, we compute hessian vector product iteratively.
In this case $Ax=g$, where $g$ is gradient of objective function($\nabla L$)

Derivation(One-Minute Derivation of The Conjugate Gradient Algorithm)

we want to solve for $Ax^*=g$

$x_{i+1}=x_{i}+\alpha_i d_i$

$x_{i+1}-x^*= x_{i}-x^*+\alpha_i d_i$

Let, $e_i=x_i-x^*$

$e_{i+1} = e_{i}+\alpha_i d_i$

$Ae_{i+1} = Ae_{i}+\alpha_i Ad_i$

Let, $r_i=g-Ax_i$

$r_{i+1} = r_{i}-\alpha_i Ad_i$

Let, we choose next residual perpendicular to previous one $r_{i+1}^Tr_i=r_{i}^Tr_{i+1}=0$.

$r_{i}^Tr_{i+1} = r_{i}^Tr_{i}-\alpha_i\, r_{i}^TAd_i=0$

We have to calculate $d_i$ iteratively.

In fact, another approach to explain this part is A-conjugate or A-orthogonal. Vector $d_i$ is n-dimensional, which is basis of A. You can find Gram Schmidt Orthogonalization if you need more information. Briefly, we can build n orthogonal basis vectors from any vector $d_0$ if A is invertible(non-zero eigen values).

Let

$d_{i+1} = r_{i+1} + \beta_{i+1}d_{i}$

$d_{i}^TAd_{i+1} = d_{i}^TAr_{i+1} + \beta_{i+1}d_{i}^TAd_{i}$

$0 = d_{i}^TAr_{i+1} + \beta_{i+1}d_{i}^TAd_{i}$

$r_{i+1}^T = r_{i}^T-\alpha_id_i^T A\,\,\,$

$r_{i+1}^Tr_{i+1} = r_{i}^Tr_{i+1}-\alpha_id_i^T Ar_{i+1}\,\,\,$

$d_i^T Ar_{i+1}=-\frac{1}{\alpha_i}r_{i+1}^Tr_{i+1}\,\,\,$

$d_{i} = r_{i} + \beta_{i}d_{i-1}$

$Ad_{i} = Ar_{i} + \beta_{i}Ad_{i-1}$

$d_{i}^TAd_{i} = d_{i}^TAr_{i} + \beta_{i}d_{i}^TAd_{i-1}$

$d_{i}^TAd_{i} = d_{i}^TAr_{i}$

Algorithm
  1. Initial guess $x_0$
  2. Set initial condition $d_0=r_0=b-Ax_0$
  3. Repeat until $r_0$ is sufficiently small(gradient of objective is near zero or linear system is near solution)
    • compute $\alpha$
    • update $x$
    • update $r$
    • compute $\beta$
    • update $d$
  4. Return $x_n$
How to compute $Ad_i$

We use conjugate gradient method to avoid caculating inverse of hessian matrix. However, there is one problem left. We need A matrix in algorithm, which is KL divergence of hessian. It is well-know that caculating gradient is computationally cheap than getting hessian itself. Therfore, we use under equation.


where $\nabla^2 KL$ is $A$

To be specific

  1. Get $KL(\pi,\pi_{old})$
  2. Use auto grad package to get $\nabla KL(\pi,\pi_{old})$
  3. Get $\nabla KL(\pi,\pi_{old})d_{i}$
  4. Again, use auto grad package to get $\nabla(\nabla KL(\pi,\pi_{old})d_{i})$

Now, you get $Ad_i$ indirectly

Summarry
  1. Compute search direction($A^{-1}\,\nabla L$) using conjugate gradient method(n-step)
    • Input : initial guess of $x_0$ ex. zero vector and gradient of objective function
    • In algorithm, use hessian vector prouct to get $Ad_i$ indirecty
  2. Get step size Using equation
  3. Update parmeter Vector($\theta$)

This post is to summary recent RL algorithm called Adavantage Weighted Regression(AWR) (paper). Detail derivation and explanation are added to help understand deeply.
※Explanation may not be accurate. Readers should read this post carefully.

Contribution(personal)

  • This algorithm improves Reward Weighted Regression by using policy improvement instead of direct policy maximization.
  • It is off-policy algorithm that has higher sample efficiency than on-policy one.
  • It can learn policy from static expert data set without collecting or sampling data from environment like behavior cloning.

Preliminaries

As always, we want to find policy that maximize return(sum of discounted rewards). We can represent objective as time or state and action.

AWR Objective & Derivation

As mentioned in contribution section, AWR algorithm maximizes policy improvement. In this equation, it is impossible to get expectation under discounted state distribution following policy($\pi$). According to 2002(Sham Kakade and John Langford) and 2015(TRPO) paper, expectation under sampling policy is tractable and it is approximate of true policy improvement with small error term(boundness).

As same derivation in preliminaries section, we can get equation under state, action expectation.

“The objective can be difficult to optimize due to the dependency between $d^{\pi}(s)$ and $\pi$, as well as the need to collect samples from $\pi$” - In the paper

we can consider this optimization problem as constrained policy search. This is because, according to early paper(2015 TRPO), $\hat{\eta}(\pi)$ is guarantee only when $\pi$ ans $\mu$ are closed enough(the closeness in probability is defined as KL-divergence).

we can re-write this equation in soft constrained form using Langrangian.

where $Z(s)$ is normalized constant to make $\pi^{*}$ sum up to 1

we want $\pi$ close enough to $\pi^*$ in terms of KL-divergence. we can write this in to optimization problem.

Following the definition of KL-divergence you can easily get under problem.

Off-policy Learning with Experience Replay

On-policy learning uses behavior policy(sampling policy) only at k-th iteration trajectory($\tau$) data, which is inefficient. This is because we throw away hole data that collected from previous iterations. Instead, AWR uses hole data in replay buffer($D$). However, state distribution and policy at each iteration are different. This makes expectation of current policy impossible. In AWR derivation, this algorithm considers samples from replay buffer as prior policy that mixture of 1~k iterations.

Algorithm

  1. Start from random policy(ex.generated by initial weights in policy network).
  2. At each iteration sample trajectory following current policy($\pi_k$).
  3. Uniformly select $N$ samples from replay buffer($D$).
  4. Update state value function($TD(\lambda)$ as target semi-gradient descent).
  5. Update policy following upper objective.

Summary

In the paper, there are sections that I do not cover in this post. For specific, the author(Jason Peng) is famous for “deepmimic” which agent learns agile skills from mocap data using RL. He suggests that AWR is better performance than Proximal Policy optimization and Reward Weighted Regression algorithm in terms of fast convergence when do motion imitation tasks. Also, AWR learns from fully static data that collects form expert. He also compares AWR with others that can learn or cloning expert policy in fully off-line manner.

This post is to introduce the key concept of reinforcement learning with math. When you want to read and understand recent RL papers, it is sometimes hard to follow the equation under expectation. I hope that this article paves the way to deeper insight towards RL. If you are interested in this topic, it is recommended to read Suttan & Burto’s book(introduction to reinforcement learning) or David Silver’s lecture.

What is Reinforcment Learning(RL)?

RL is one branch of machine learning. In contrast to other machine learning such as supervised and unsupervised learning, it is based on try and error method that collects data by interacting with environment also called world.

RL is built on Markov Decision Process(MDP) which means that next state only depends on current states(Markov Property). MDP in RL has 5 elements: states($S$), action($A$), transition probability($P$), discount factor($\gamma$), and reward($R$).

The Goal of RL is to find the policy that maximize cumulative(sum of) discounted rewards.
※ Sum of rewards depends on which problem you want to solve. Ex. one is episodic task, the other is infinite time horizon task.

Value Function & Bellman Equation

There are two kinds of value function. One is state value function($V(s)$) and the other is action value function($Q(s,a)$).
State value function denotes how much return(sum of rewards) would get from this state($s$).

Action value function means how much return would get from this state when choosing particular action($a$)

We can also write the equation using dynamic programming which writes the equation in relationship between current and next.

Bellman Optimality Equation

Key idea is to choose the policy greedly that maximize value function. This means that policy value which is probability of choosing action at state is 1 for max value function.

Policy Evaluation

  • Use Bellman Expectation Equation to update current value function(state or action) by looking ahead
  • Value function indicates how good current policy is

Value Iteration

  • Use Bellman Optimality Equation to update current value function(state or action) by looking ahead
  • To find optimal policy and optimal value function
  • After few iterations, Value function is converge. We can use this value function as policy, which means choosing action that gives highest value function.
  • How to find optimal policy?
    • For state value function, we can use optimality equation again to make $Q^{*}(s,a)$
    • For action value function, we just use $Q^{*}(s,a)$ to decide which action to take.※Convergence is guaranteed for discrete state and action case with discounted reward following contraction mapping theorem.

Policy Iteration

  • In contrast to Value Iteration, Policy Iteration chooses greedy action according to action value function at each iteration.
  • Algorithm
    • for $1$ : $K$ iterate
      1. Estimate Value function using Bellman Equation(Policy Evaluation)
      2. Choose action greedily

Summary & Limitation

Almost all of current SOTA in deep reinforcement learning is based on 3 basic algorithms(policy evaluation, value iteration, and policy iteration). For example, we can see Deep Q Network(DQN) as policy iteration. This is because after we collect samples from black box environment, DQN algorithm fits Q function to reduce bellman error(the sqaured error of Bellman Equation). Then, we choose eplison-greedy action based on updated Q function. However, there are limitations directly using above algorithms.

  • Unknown Transition Probability
    First, we do not know explicit transition probability. One way to address this problem is to use model based reinforcement learning, which it defines or makes model of transition probability. After that, we can predict or plan the trajectory(states and action sequence) based on model. Another popular method is model-free reinforcement learning, which it learns policy, value function, or both end-to-end manner from sample. This method does not care modeling transition probability. It assumes transition probability as black box model.

  • Curse of Dimensionality
    In practice, most problems have high-dimensional action and state space. It is hard to deal with huge matrix information. For specific, when we have $N$ states and $M$ actions. We need large values for each components: $N^2M$ for transition probability, $NM$ for policy, $NM$ for policy, $NM$ for action value function, and $N$ for state value function. When state and action space are continuous, it is almost impossible to discretize the space. Therefore, we can not full back-up hole MDP tree or look ahead all state and action space. To deal with this, we use samples to estimate value function such as Monte Carlo estimation or Temporal Difference Learning(TD).

This post is to check markdown and mathjax setting

Markdown Test

Title

big

big

big

big

big

Code Block

1
2
3
4
#python test
import numpy as np
import matplotlib.pyplot as plt
import torch

1
2
3
4
#include <iostream>
#include <fstream>
#include "Eigen/Core"
#include "ros.h"

Latex Test

\begin{pmatrix}
1 & a_1 & a_1^2 & \cdots & a_1^n \\\\
1 & a_2 & a_2^2 & \cdots & a_2^n \\\\
\vdots & \vdots& \vdots & \ddots & \vdots \\\\
1 & a_m & a_m^2 & \cdots & a_m^n
\end{pmatrix}