(243d) Learning and Adapting Model Predictive Controllers with Reinforcement Learning for Time-Varying Systems
RL entails a Markov-decision process (MDP) whereby an actor applies an input to a system and a critic determines a reward based on the new state2. Use of the RL with MPC in presence of parametric uncertainty for a linear system has been addressed3. For utilizing the evolving information about future uncertainty, a RL approach has been proposed by using the learned cost-to-go as the terminal penalty in a MPC4. However, one of the critical issue that can lead to poor performance of the MPC is model discrepancy. In this work, we propose a novel RL algorithm for learning as well as adapting the MPC.
Q-Learning, one of the RL methods, can be used to learn the state dynamics and the value function5. However, adapting the MPC model based on the Q-function for time-varying systems is impractical since the space of Q-function is infinite dimensional for a time-varying system. Here we use a BSS-ANOVA GP where the eigenfunctions in the Karhunen-LoÃ©ve (KL) expansion are used as the orthogonal basis functions. One of the key advantages of using a KL expansion with the GP model for the discrepancy function is that the stochasticity is represented by the discrepancy parameters, since the basis functions for each functional component do not change with the covariance function parameters. This translates to reduced computational costs. Residual analysis of the Bellmanâs optimality equation along with the policy gradient and actor-critic methods are then used for model adaptation. The value functions and the policy as a map of control actions are stored in compact clusters by using a subtractive clustering technique for unsupervised learning of unique, or core, control features. Cores are automatically updated as new information are gathered. The algorithm also includes directed exploration methods that add an intrinsic award to the original reward ensuring that the infinite-horizon cost function converges to the exact cost-to-go function as the discrepancy vanishes. Feasibility and optimality conditions of the proposed algorithm are also analyzed.
The algorithm developed here is applied to the load-following problem in the operation of a supercritical pulverized coal (SCPC) power plant. Here, one of the critical control problems is that of the main steam temperature control under load changes. Due to sliding pressure operation and due to significant nonlinearity of steam properties in the operating domain as well as evolving ash buildup on the tubes, this system is a time-varying nonlinear system. The proposed RL-augmented MPC algorithm is evaluated using a high-fidelity dynamic model of the SCPC plant6.
 S. J. Qin and T. A. Badgwell, âA survey of industrial model predictive control technology,â Control Eng. Pract., vol. 11, no. 7, pp. 733â764, Jul. 2003.
 T. A. Badgwell, J. H. Lee, and K.-H. Liu, âReinforcement Learning â Overview of Recent Progress and Implications for Process Control,â in Computer Aided Chemical Engineering, vol. 44, M. R. Eden, M. G. Ierapetritou, and G. P. Towler, Eds. Elsevier, 2018, pp. 71â85.
 J. E. Morinelly and B. E. Ydstie, âDual MPC with Reinforcement Learning,â 11th IFAC Symp. Dyn. Control Process Syst. Biosyst. DYCOPS-CAB 2016, vol. 49, no. 7, pp. 266â271, Jan. 2016.
 J. Lee, W. Wong, âApproximate Dynamic Programming Approach for Process Control,â Journal of Process Control, vol. 20, pp. 1038-1048, 2010.
 C. Watkins, âLearning from Delayed Rewards,â Dissertation, Cambridge, London, 1989.
 P. Sarda, E. Hedrick, K. Reynolds, D. Bhattacharyya, E. S. Zitney, and B. Omell, âDevelopment of a Dynamic Model and Control System for Load-Following Studies of Supercritical Pulverized Coal Power Plants,â Processes, vol. 6, no. 11, 2018.