(373z) Two-Stage Reinforcement Learning for Batch Bioprocess Optimization Under Uncertainty
It follows that we seek a strategy that can optimize a process and handle both the systemâs stochastic behaviours (e.g. process disturbances) and plant-model mismatches. In this work we have opted to use Reinforcement learning and more specifically, Policy Gradients , as an alternative to current existing methods.
The chemical engineering community has been dealing with stochastic biosystems for a long time. For example, nonlinear dynamic optimization and particularly nonlinear MPC is a powerful methodology to address uncertain dynamic systems, however there are several properties that make its application less attractive. Most MPC approaches require the knowledge of a detailed model that describes the systemâs dynamics, and stochastic MPC additionally requires an assumption of the uncertainty quantification/propagation. Furthermore, conventional MPC assumes open-loop control actions at future time points in the prediction, which can lead to overly conservative control actions.
In contrast, RL directly accounts for the effect of future uncertainty and its feedback in a proper âclosed-loopâ manner . In addition, policy gradients can establish a policy in a model-free fashion and excel at on-line computational time. This is because the online computations require only evaluation of a policy, since all the computational cost is shifted off-line.
In this work we propose a two-stage reinforcement learning strategy. We assume that a process model is available, which is exploited to obtain a preliminary optimal control policy. Reinforcement learning is utilized to train the policy off-line for a large number of epochs and episodes, shifting most of the computational effort off-line. This policy has been chosen as a recurrent neural network that receives a window of past states and control actions as well as the current state, providing as an output the stochastic policy from which the control action is drawn.
Subsequently, during the online optimization stage, and by implementing elements from transfer learning , the policy network adapts to and optimizes the true system (the plant). The approach is verified in a series of case studies including stochastic differential equation systems with complex dynamics.
 R. Sutton, A. Barto, 2018. Reinforcement Learning: An Introduction Second Edition. MIT Press.
 J. H. Lee, J. M. Lee, 2006. Approximate dynamic programming based approach to process controland scheduling. Computers & Chemical Engineering 30 (10-12), 1603â1618
 A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012. ImageNet Classification with Deep ConvolutionalNeural Networks. In: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advancesin Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1097â1105.