We are aware of an issue with certificate availability and are working diligently with the vendor to resolve. The vendor has indicated that, while users are unable to directly access their certificates, results are still being stored. Certificates will be available once the issue is resolved. Thank you for your patience.

(243d) Learning and Adapting Model Predictive Controllers with Reinforcement Learning for Time-Varying Systems

Hedrick, E., West Virginia University
Reynolds, K., West Virginia University
Sarda, P., West Virginia University
Bhattacharyya, D., West Virginia University
Zitney, S. E., National Energy Technology Laboratory
Omell, B. P., National Energy Technology Laboratory
While model predictive control (MPC) has been a primary tool of the process control community1, automatic learning of models and transitioning to previously-developed models, if available, is challenging for time-varying systems. The main challenge in transitioning to previously developed models lies in identifying the similarity of the current or immediate control problem with one or more previous control problem(s). Here we propose a novel MPC augmented with reinforcement learning (RL) for learning and adapting.

RL entails a Markov-decision process (MDP) whereby an actor applies an input to a system and a critic determines a reward based on the new state2. Use of the RL with MPC in presence of parametric uncertainty for a linear system has been addressed3. For utilizing the evolving information about future uncertainty, a RL approach has been proposed by using the learned cost-to-go as the terminal penalty in a MPC4. However, one of the critical issue that can lead to poor performance of the MPC is model discrepancy. In this work, we propose a novel RL algorithm for learning as well as adapting the MPC.

Q-Learning, one of the RL methods, can be used to learn the state dynamics and the value function5. However, adapting the MPC model based on the Q-function for time-varying systems is impractical since the space of Q-function is infinite dimensional for a time-varying system. Here we use a BSS-ANOVA GP where the eigenfunctions in the Karhunen-Loéve (KL) expansion are used as the orthogonal basis functions. One of the key advantages of using a KL expansion with the GP model for the discrepancy function is that the stochasticity is represented by the discrepancy parameters, since the basis functions for each functional component do not change with the covariance function parameters. This translates to reduced computational costs. Residual analysis of the Bellman’s optimality equation along with the policy gradient and actor-critic methods are then used for model adaptation. The value functions and the policy as a map of control actions are stored in compact clusters by using a subtractive clustering technique for unsupervised learning of unique, or core, control features. Cores are automatically updated as new information are gathered. The algorithm also includes directed exploration methods that add an intrinsic award to the original reward ensuring that the infinite-horizon cost function converges to the exact cost-to-go function as the discrepancy vanishes. Feasibility and optimality conditions of the proposed algorithm are also analyzed.

The algorithm developed here is applied to the load-following problem in the operation of a supercritical pulverized coal (SCPC) power plant. Here, one of the critical control problems is that of the main steam temperature control under load changes. Due to sliding pressure operation and due to significant nonlinearity of steam properties in the operating domain as well as evolving ash buildup on the tubes, this system is a time-varying nonlinear system. The proposed RL-augmented MPC algorithm is evaluated using a high-fidelity dynamic model of the SCPC plant6.


[1] S. J. Qin and T. A. Badgwell, “A survey of industrial model predictive control technology,” Control Eng. Pract., vol. 11, no. 7, pp. 733–764, Jul. 2003.

[2] T. A. Badgwell, J. H. Lee, and K.-H. Liu, “Reinforcement Learning – Overview of Recent Progress and Implications for Process Control,” in Computer Aided Chemical Engineering, vol. 44, M. R. Eden, M. G. Ierapetritou, and G. P. Towler, Eds. Elsevier, 2018, pp. 71–85.

[3] J. E. Morinelly and B. E. Ydstie, “Dual MPC with Reinforcement Learning,” 11th IFAC Symp. Dyn. Control Process Syst. Biosyst. DYCOPS-CAB 2016, vol. 49, no. 7, pp. 266–271, Jan. 2016.

[4] J. Lee, W. Wong, “Approximate Dynamic Programming Approach for Process Control,” Journal of Process Control, vol. 20, pp. 1038-1048, 2010.

[5] C. Watkins, “Learning from Delayed Rewards,” Dissertation, Cambridge, London, 1989.

[6] P. Sarda, E. Hedrick, K. Reynolds, D. Bhattacharyya, E. S. Zitney, and B. Omell, “Development of a Dynamic Model and Control System for Load-Following Studies of Supercritical Pulverized Coal Power Plants,” Processes, vol. 6, no. 11, 2018.