(299c) Model Predictive Control As a Reinforcement Learning Policy: Faster Learning Via Policy Rollouts | AIChE

(299c) Model Predictive Control As a Reinforcement Learning Policy: Faster Learning Via Policy Rollouts

Authors 

Hedrick, E. - Presenter, West Virginia University
Hedrick, K., West Virginia University
Bhattacharyya, D., West Virginia University
Zitney, S., National Energy Technology Laboratory
Omell, B. P., National Energy Technology Laboratory
Reinforcement learning (RL) is a machine learning approach for automatic control that has seen significant research activity in recent years [1]. There exists strong interest in the synergistic application of RL with existing process control structures to alleviate problems with sample inefficiency and safety in using RL directly for process control [2]. Combination of RL with MPC has seen more research interest than combination with other process controllers [3]–[6]; which is also the focus of this work. This talk presents an approach for direct combination of RL and MPC by using the MPC as the control policy for the RL in an approximate RL formulation.

In the approach proposed in this work, RL is used to directly improve the value function used in the MPC. The resulting policy amounts to maximization of the learned action-value function, subject to the modeled dynamics, over a predicted trajectory. Because the controller follows the current policy in the projection this is an on-policy algorithm, and the SARSA update to the value function is used. The basis functions used in the approximation of the action-value function are radial basis functions in the states and control moves, multiplied by the standard quadratic terms such that the results objective function is zero at zero and positive elsewhere (guaranteeing stability in the linear case by ensuring that the weights learned by the RL agent are positive definite). This algorithm is applied in two ways, first using the one-step update from the MPC optimization at each timestep. The second approach takes the optimal MPC trajectory as a policy rollout and computes an n-step return for updating the action-value function, where n can be any length up to Np.

These algorithms are applied to a benchmark process control example where the states are assumed to be measurable. Learning experiments are carried out in structured episodes, injecting disturbances into the plant and ending once the output has been returned to the setpoint. Learning is carried out at each timestep, and the weights of the RL agent are initialized to unity (corresponding to a static MPC with Q=R=I). Under both the one-step approach and the n-step approach it is shown that superior performance to the static controller is achieved after learning (given the random selection of actions in explorative steps the results are standardized over 20 trials and presented with respect to median performance). In the case of the n-step algorithm faster learning can be achieved through the policy rollout, but performance does not improve monotonically with search depth; it is found that moderate values of n yield the fastest learning (with respect to achieving performance close to the final policy). While this may not be intuitive, the lack in performance with higher values of n stems from propagation of error in the action-value function further into the update, especially during the early stages of learning. While both approaches are able to improve performance over the static MPC, the algorithm using a moderate search depth is able to improve the policy significantly faster that the one-step algorithm.

Bibliography

[1] D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017, doi: 10.1038/nature24270.

[2] J. Shin, T. A. Badgwell, K. H. Liu, and J. H. Lee, “Reinforcement Learning – Overview of recent progress and implications for process control,” Comput. Chem. Eng., vol. 127, pp. 282–294, Aug. 2019, doi: 10.1016/j.compchemeng.2019.05.029.

[3] D. Görges, “Relations between Model Predictive Control and Reinforcement Learning,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 4920–4928, Jul. 2017, doi: 10.1016/j.ifacol.2017.08.747.

[4] S. Gros and M. Zanon, “Data-driven economic NMPC using reinforcement learning,” IEEE Trans. Automat. Contr., vol. 65, no. 2, pp. 636–648, Feb. 2020, doi: 10.1109/TAC.2019.2913768.

[5] M. Zanon and S. Gros, “Safe Reinforcement Learning Using Robust MPC,” IEEE Trans. Automat. Contr., vol. 66, no. 8, pp. 3638–3652, Aug. 2021, doi: 10.1109/TAC.2020.3024161.

[6] E. Hedrick, K. Hedrick, D. Bhattacharyya, S. E. Zitney, and B. Omell, “Reinforcement learning for online adaptation of model predictive controllers: Application to a selective catalytic reduction unit,” Comput. Chem. Eng., vol. 160, p. 107727, Apr. 2022, doi: 10.1016/j.compchemeng.2022.107727.