(687d) Reinforcement (Q-) Learning and Adaptive Model Predictive Control: Conflict or Conflux | AIChE

(687d) Reinforcement (Q-) Learning and Adaptive Model Predictive Control: Conflict or Conflux


Cai, Z. - Presenter, Carnegie Mellon University
Ydstie, B. E., Carnegie Mellon University
Adaptive and machine learning controllers perform tracking and regulation tasks by tuning control parameters. In the case of RL/Q-Learning, the control function is adapted directly in such a way that the performance improves over time. In the case of adaptive model predictive control (aMPC), a model is updated and then MPC controller is updated indirectly using optimization.

The purpose of the current paper is to compare and contrast the direct and indirect control methods by using a linear system with a quadratic performance objective as an example. The system to be controlled and the control objective can now be equivalent, and it is possible to compare and contrast the two approaches in simulation and by Lyapunov-type stability theory. In both cases, we will apply the same recursive least squares (RLS) algorithm to tune control parameters. The control design methods are quite different however; and the performance and stability properties of the two approaches also turn out to be quite different. To compare the methods, we will say that a conflux exists if the parameter tuning and control algorithms exhibit coinciding and unique fixed points. In such a case we can guarantee optimality of the algorithm if it is stable and converges. Otherwise, tuning and control are said to be in conflict and optimality cannot be guaranteed even if the algorithm converges.

Classical certainty-equivalence(CE) aMPC treats control and learning as separate tasks. This approach does not generate input signals that are rich enough for good parameter estimation. It is known that stability can be guaranteed while optimality cannot be guaranteed unless additional excitation is provided. In this paper we show that the RL/Q-Learning approach provides neither stability nor optimality unless the signals are excited. Moreover, the excitation requirements are more stringent and to make matters worse, RL/Q-Learning has to use slow learning to be stable. This requirement is not paralleled in aMPC. Thus, the same asymptotic stability and optimality properties hold for direct and indirect learning as long as excitation holds; but the convergence rate of RL/Q-Learning is much slower.

Dual control theory [4] discusses the trade-off between two tasks: trajectory and parameter convergence. A dual adaptive controller is designed to optimally improve the learning while maintaining robust control performance. The ideas go back to the original work of Feldbaum [1]. However, algorithms and theory for application in non-trivial examples have been developed only recently [2].

In the current paper we explore the application of dual control as a vehicle to improve the performance of aMPC and RL/Q-Learning. In particular we want to address the issue Polderman [3] described as the “conflux or conflict” between estimation and control as described above. In his seminal work, he showed that the conflux existed between estimation and control for minimum variance and pole-placement control. However, it does not exist for infinite horizon aMPC. Instead, there is conflict resulting in stable, but sub-optimal controls. Bradke et al. [4] demonstrated a similar conflict between estimation and control in RL/Q-learning. The purpose of our presentation is to reviews these results very briefly and explore the possibility of using dual control to stabilize learning controllers and provide asymptotic optimality, thus resolving the above-mentioned conflict that exist in both the direct and indirect approaches to learning control.


  1. Feldbaum, A. (1960). Dual control theory. i. Avtomatika i Telemekhanika, 21(9), 1240– 1249.
  2. Heirung, T. A. N., Ydstie, B. E., & Foss, B. (2017). Dual adaptive model predictive control. Automatica, 80, 340–348.
  3. Polderman, J.W., 1989. Adaptive control & identification: conflict or conflux?. CWI Tracts, (67).
  4. Bradtke, S.J., Ydstie, B.E. and Barto, A.G., 1994, June. Adaptive linear quadratic control using policy iteration. In Proceedings of 1994 American Control Conference-ACC'94(Vol. 3, pp. 3475-3479). IEEE