(371g) Stabilization-Oriented Learning Algorithm for Optimal Control of Nonlinear Control-Affine System

Kim, Y., Seoul National University
Lee, J. M., Seoul National University
Three major approaches have been developed in order to solve optimal control problems [1]. (1) The first approach is to solve a finite horizon optimal control problem given a specified initial condition using either the calculus of variations or the optimization algorithm directly. The resulting solution is an open-loop control policy so that the solution is applied for a short duration and the optimal control problem should be repeatedly solved based on updated system information. The well-known model predictive control (MPC) method is included in this approach. (2) The second approach is to find the solution of Hamilton-Jacobi-Bellman (HJB) equations that are constructed based on Bellman’s principle of optimality. The solution of HJB is called the optimal value function (V*(x)), which gives the minimum value of cost function when the system is operated under the optimal control policy with initial condition x. Although this approach can provide a closed-loop optimal control policy for all initial states, HJB is a nonlinear partial differential equation (PDE) and is difficult to solve. Thus, several algorithms have been developed to find the optimal value function based on basic algorithms (e.g., policy iteration (PI) and value iteration (VI)) [2], [3], [4]. (3) The third approach is the inverse control approach, based on the fact that every control Lyapunov function (CLF) is the solution of the HJB equation for a system with a cost function [5]. A variation of Sontag's formula provides the optimal control policy of the system with a user-defined cost function, in which the CLF has the same shape of level sets as the optimal value function [6].

The conventional methods, excluding the third approach, are focusing more on optimality than stability. For example, to guarantee the stability of MPC, it is necessary to include restraints for terminal constraint or terminal cost. In the second method, in order for the LgV-type optimal formula of the control-affine system to be an asymptotic stabilizing policy, the value function should be the solution of the Lyapunov equation with an asymptotic stabilizing policy. However, the Lyapunov equation is also a PDE, which is difficult to solve. Even by using a neural network to approximate the solution of HJB, it is difficult to guarantee the closed-loop stability using LgV-type optimal formula in the process of updating the weights of a neural network, especially in the early stages of learning.

Although the conventional inverse control approach focuses on stability rather than optimality, it is important that the CLF has the same shape of a level set as the value function, in order for Sontag's formula to provide the optimal controller with a user-defined cost function. However, only few studies have been conducted to develop the algorithm to find such CLF [1]. In [1], the CLF is learned by adjusting the cost function for which the CLF is the optimal value function, in order to provide similar performance to the optimal controller that minimizes the user-specified cost function. In our study, we modify the algorithm developed in the second approach to develop a new algorithm that learns CLF having the same level set as the optimal value function for the user-specified cost function, unlike the method of [1].

In this study, we propose a new PI-based algorithm for learning both the optimal value function and the optimal control for nonlinear control-affine systems, while guaranteeing the stability. We prove the stability of the system and the convergence to the optimal controller when solving Lyapunov equations for policy evaluation as well as using Sontag's formula for policy improvement. Even when using function approximation and gradient descent method for policy evaluation, the closed-loop stability is guaranteed during the whole learning process. Since Sontag's formula can provide the asymptotically stabilizing controller with the neural network, which is constrained as CLF, our controller can asymptotically stabilize the system even with the approximation errors.

However, the LgV-type optimal formula cannot guarantee stability because the approximate function is not exactly same as the optimal value function. Therefore, the optimal formula cannot yield the optimal controller, and CLF is not a sufficient condition to use the optimal formula to obtain a stabilizing controller. Through simulation, we show several cases where the system becomes unstable during the training process when the LgV-type optimal formula is used for policy improvement.


[1] Rohrweck, H., Schwarzgruber, T., Re, L. (2015). Approximate optimal control by inverse CLF approach. IFAC-PapersOnLine, 48(11), 286-291.

[2] Vamvoudakis, K. G., Lewis, F. L. (2010). Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 46(5), 878-888.

[3] Kamalapurkar, R., Rosenfeld, J. A., Dixon, W. E. (2016). Efficient model-based reinforcement learning for approximate online optimal control. Automatica, 74, 247-258.

[4] Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L. (2009). Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica, 45(2), 477-484.

[5] Freeman, R. A., Kokotovic, P. V. (1996). Inverse optimality in robust stabilization. SIAM Journal on Control and Optimization, 34(4), 1365-1391.

[6] Primbs, J. A., Nevistić, V. and Doyle, J. C. (1999). Nonlinear optimal control: a control lyapunov function and receding horizon perspective. Asian Journal of Control, (1), 14-24.