This post is to derive popular optimal control framework called Linear Quadratic Regulator(LQR) and other variants. There are lots of method to derive LQR Algorihm. However, we will follow reinforcement learning perspective.
1. Notation
- State at time $t$: $x_t$
- Action(Input) at time $t$: $u_t$
- Policy: $\pi(u_t \vert x_t)$
- Cost to go at time t and state $x_t$: $V_{t}(x_t)$
2. Objective
Optimal control is to find sequential action or policy(stochastic or deterministic) from particular state that minimized objective.
3. Value iteration
Similar to value iteration in reinforcement learning, we can apply bellman backup to update current value function(Cost to go).
Let, $V_{t}(x_t) = \min_{u_{t:T-1}} \Big[x_{T}^TQ_fx_T+\sum_{\tau=t}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}\Big]$
Then, $V_{t+1}(x_{t+1}) = \min_{u_{t+1:T-1}} \Big[x_{T}^TQ_fx_T+\sum_{\tau=t+1}^{T-1}x_{\tau}^TQx_{\tau}+u_{\tau}^TRu_{\tau}\Big]$
we can formulate recursive equation.
the deterministic policy is now become,
4. Explicit Policy
We can also assume the value function(cost to go) is quadratic by exploiting the property of cost function($J$).
Let, $V_{t}(x_t) = x_t^TP_tx_t+q_t$, where $P_t$ is matrix and $q_t$ is scalar.
From deterministic policy in section 3, we can substitute cost to go function and get gradient w.r.t $u_t$.
$F(x_t,u_t) = (Ax_t+Bu_t)^TP_{t+1}(Ax_t+Bu_t)+q_{t+1} + x_{t}^TQx_{t} + u_{t}^TRu_{t}$
$\nabla_{u_{t}} F(x_t,u_t) = 2Ru_t + 2B^TP_{t+1}(Ax_t+Bu_t) = 0$
We can re-write the recursive equation in value iteration by substituting cost to go function with quadratic function.
Initial Condition
$P_T = Q_f$
$q_T = 0$
For t=T-1:0
$P_{t}=(A+BK_t)^TP_{t+1}(A+BK_t)+K_t^TRK_t+Q$
where,$\,K_t = -(B^TP_{t+1}B+R)B^TP_{t+1}A$
$q_{t}=q_{t+1}$
Optimal action(input) at time t: $u_t^{*}=K_tx_t$
Cost to go at time t: $V_{t}(x_t) = x_t^TP_tx_t+q_t$
5. Variants
Linear Time Invariant System with Infinite horizon control.
Iterating $K_t$ matrix to converge or solve Riccati equation.
Then, use $u_{t} = K_{ss}x_t$ to control each step.
Linear dynamics with constant(Affine system)
$x_{t+1} = Ax_{t}+Bu_{t} + c$
Let’s make augmented state.
$\begin{bmatrix}x_{t+1}\\\\1\end{bmatrix} = \begin{bmatrix}A&c\\\\O&1\end{bmatrix}\begin{bmatrix}x_{t}\\\\1\end{bmatrix}+\begin{bmatrix}B\\\\O\end{bmatrix}u_t$
Then, following same derivation above you will get.
$u_t = K_tz_t$, where $z_t = \begin{bmatrix}x_{t}\\\\1\end{bmatrix}$
Linear dynamics with noise(stochastic dynamics)
$x_{t+1} = Ax_{t}+Bu_{t} + w_t$, where $E[w_t]=0$ and $E[w_t^Tw_t]=\Sigma_w$
$P_{t}=(A+BK_t)^TP_{t+1}(A+BK_t)+K_t^TRK_t+Q$, which is same as deterministic case.
$q_{t} = E[w_t^TP_{t+1}w_t]+q_{t+1} = Tr(WP_{t+1}) + q_{t+1}$
Control is also same as deterministic case. $u_t = K_tx_t$
Linear Time Variant system
Change $A$ and $B$ to correspond time step matrix(ex. $A_t$ and $B_t$)
Penalization for change in control inputs
From linear system, $x_{t+1} = Ax_{t}+Bu_{t}$
$x_{t+1} = Ax_{t}+Bu_{t-1}+Bu_{t}-Bu_{t-1}$
$u_{t} =u_{t-1} + u_{t}-u_{t-1}$
Trajectory following for non-linear systems(also applied to non-linear system stabilization)
Let, non-linear system as $x_{t+1} = f(x_t, u_t)$.
Using partial differential w.r.t ($x_t^{ref}$, $u_t^{ref}$) and 1st-order approximation
Subtract next reference state at both side.
We can make approximated linearized affine system with
Transformed state: $z_t = \begin{bmatrix}x_{t} - x_{t}^{ref}\\\\1\end{bmatrix}$
Transformed input: $v_t = u_{t} - u_{t}^{ref}$
where,
Now, we get control policy(feedback law).
Reference
- CS287 Advanced Robotics Lecture 5