Optimal Control(Part 1 - LQR) | Welcome to Donghyun's Blog

This post is to derive popular optimal control framework called Linear Quadratic Regulator(LQR) and other variants. There are lots of method to derive LQR Algorihm. However, we will follow reinforcement learning perspective.

1. Notation

State at time $t$: $x_t$
Action(Input) at time $t$: $u_t$
Policy: $\pi(u_t \vert x_t)$
Cost to go at time t and state $x_t$: $V_{t}(x_t)$

2. Objective

Optimal control is to find sequential action or policy(stochastic or deterministic) from particular state that minimized objective.

$J = \min_{u_{0:T-1}}x_{T}^TQ_fx_T+\sum_{\tau=0}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}$ $s.t\,\,x_{t+1} = Ax_t+Bu_t$

3. Value iteration

Similar to value iteration in reinforcement learning, we can apply bellman backup to update current value function(Cost to go).

Let, $V_{t}(x_t) = \min_{u_{t:T-1}} \Big[x_{T}^TQ_fx_T+\sum_{\tau=t}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}\Big]$

Then, $V_{t+1}(x_{t+1}) = \min_{u_{t+1:T-1}} \Big[x_{T}^TQ_fx_T+\sum_{\tau=t+1}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}\Big]$

we can formulate recursive equation.

$V_{t}(x_t) = \min_{u_{t}}\Big[ V_{t+1}(x_{t+1}) + x_{t}^TQx_{t} + u_{t}^TRu_{t}\Big]$

the deterministic policy is now become,

$u_t^{＊}=\pi^{＊}(x_t) = arg\min_{u_{t}} \Big[V_{t+1}(Ax_t+Bu_t) + u_{t}^TRu_{t}\Big]$

4. Explicit Policy

We can also assume the value function(cost to go) is quadratic by exploiting the property of cost function($J$).

Let, $V_{t}(x_t) = x_t^TP_tx_t+q_t$, where $P_t$ is matrix and $q_t$ is scalar.

From deterministic policy in section 3, we can substitute cost to go function and get gradient w.r.t $u_t$.
$F(x_t,u_t) = (Ax_t+Bu_t)^TP_{t+1}(Ax_t+Bu_t)+q_{t+1} + x_{t}^TQx_{t} + u_{t}^TRu_{t}$

$\nabla_{u_{t}} F(x_t,u_t) = 2Ru_t + 2B^TP_{t+1}(Ax_t+Bu_t) = 0$

$u_t^{＊}=\pi^{＊}(x_t)=-(B^TP_{t+1}B+R)B^TP_{t+1}Ax_t$ $K_t = -(B^TP_{t+1}B+R)B^TP_{t+1}A$

We can re-write the recursive equation in value iteration by substituting cost to go function with quadratic function.

$x_t^TP_{t}x_t+q_t = (Ax_t+Bu_t^{＊})^TP_{t+1}(Ax_t+Bu_t^{＊})+q_{t+1} + x_{t}^T Q x_{t} + {u_t^{＊}}^T R u_t^{＊}$ $x_t^TP_{t}x_t+q_t = (Ax_t+BK_tx_t)^TP_{t+1}(Ax_t+BK_tx_t)+q_{t+1} + x_{t}^TQx_{t} + (K_tx_t)^TRK_tx_t$

Initial Condition
$P_T = Q_f$

$q_T = 0$

For t=T-1:0

$P_{t}=(A+BK_t)^TP_{t+1}(A+BK_t)+K_t^TRK_t+Q$

where,$\,K_t = -(B^TP_{t+1}B+R)B^TP_{t+1}A$
$q_{t}=q_{t+1}$

Optimal action(input) at time t: $u_t^{＊}=K_tx_t$

Cost to go at time t: $V_{t}(x_t) = x_t^TP_tx_t+q_t$

5. Variants

Linear Time Invariant System with Infinite horizon control.

Iterating $K_t$ matrix to converge or solve Riccati equation.
Then, use $u_{t} = K_{ss}x_t$ to control each step.

Linear dynamics with constant(Affine system)

$x_{t+1} = Ax_{t}+Bu_{t} + c$

Let’s make augmented state.
$\begin{bmatrix}x_{t+1}\\\\1\end{bmatrix} = \begin{bmatrix}A&c\\\\O&1\end{bmatrix}\begin{bmatrix}x_{t}\\\\1\end{bmatrix}+\begin{bmatrix}B\\\\O\end{bmatrix}u_t$

Then, following same derivation above you will get.
$u_t = K_tz_t$, where $z_t = \begin{bmatrix}x_{t}\\\\1\end{bmatrix}$

Linear dynamics with noise(stochastic dynamics)

$x_{t+1} = Ax_{t}+Bu_{t} + w_t$, where $E[w_t]=0$ and $E[w_t^Tw_t]=\Sigma_w$

$P_{t}=(A+BK_t)^TP_{t+1}(A+BK_t)+K_t^TRK_t+Q$, which is same as deterministic case.

$q_{t} = E[w_t^TP_{t+1}w_t]+q_{t+1} = Tr(WP_{t+1}) + q_{t+1}$

Control is also same as deterministic case. $u_t = K_tx_t$

Linear Time Variant system

Change $A$ and $B$ to correspond time step matrix(ex. $A_t$ and $B_t$)

Penalization for change in control inputs

From linear system, $x_{t+1} = Ax_{t}+Bu_{t}$

$x_{t+1} = Ax_{t}+Bu_{t-1}+Bu_{t}-Bu_{t-1}$
$u_{t} =u_{t-1} + u_{t}-u_{t-1}$

$\begin{bmatrix}x_{t+1}\\\\u_t\end{bmatrix} = \begin{bmatrix}A&B\\\\O&I_{u}\end{bmatrix}\begin{bmatrix}x_{t}\\\\u_{t-1}\end{bmatrix} + \begin{bmatrix}B\\\\I_{u}\end{bmatrix}\Delta u$

Trajectory following for non-linear systems(also applied to non-linear system stabilization)

Let, non-linear system as $x_{t+1} = f(x_t, u_t)$.

Using partial differential w.r.t ($x_t^{ref}$, $u_t^{ref}$) and 1st-order approximation

$x_{t+1} \simeq f(x_t^{ref}, u_t^{ref}) + \frac{\partial f}{\partial x}(x_t - x_t^{ref}) \frac{\partial f}{\partial u}(u_t - u_t^{ref})$

Subtract next reference state at both side.

$x_{t+1} - x_{t+1}^{ref} \simeq f(x_t^{ref}, u_t^{ref}) - x_{t+1}^{ref} + \frac{\partial f}{\partial x}(x_t - x_t^{ref}) \frac{\partial f}{\partial u}(u_t - u_t^{ref})$

We can make approximated linearized affine system with

Transformed state: $z_t = \begin{bmatrix}x_{t} - x_{t}^{ref}\\\\1\end{bmatrix}$

Transformed input: $v_t = u_{t} - u_{t}^{ref}$

$z_{t+1} = A_tz_t + B_tv_t$

where,

$c = f(x_t^{ref}, u_t^{ref}) - x_{t+1}^{ref}$ $,A_t = \begin{bmatrix} \frac{\partial f}{\partial x} & c\\\\ O&1 \end{bmatrix}$ $B_t = \begin{bmatrix} \frac{\partial f}{\partial u}\\\\O \end{bmatrix}$

Now, we get control policy(feedback law).

$u_t = K_t\begin{bmatrix}x_{t} - x_{t}^{ref}\\\\1\end{bmatrix} + u_{t}^{ref}$

Reference

CS287 Advanced Robotics Lecture 5