(373z) Two-Stage Reinforcement Learning for Batch Bioprocess Optimization Under Uncertainty

Petsagkourakis, P., University College of London
del Rio Chanona, E. A., Imperial College London
Zhang, D., University of Manchester
Bradford, E., NTNU
Bioprocesses have received a lot of attention to produce clean and sustainable alternatives to fossil-based materials. However, they are generally difficult to optimize due to their unsteady-state operation modes and stochastic behaviours. Furthermore, biological systems are highly complex, therefore plant-model mismatch is often present. Bioprocess optimization suffers from three conditions: 1) there is no precise model known for the process under consideration (plant-model mismatch), leading to inaccurate predictions and convergence to suboptimal solutions, 2) the process presents disturbances and 3) the system is risk-sensitive hence exploration is inconvenient.

It follows that we seek a strategy that can optimize a process and handle both the system’s stochastic behaviours (e.g. process disturbances) and plant-model mismatches. In this work we have opted to use Reinforcement learning and more specifically, Policy Gradients [1], as an alternative to current existing methods.

The chemical engineering community has been dealing with stochastic biosystems for a long time. For example, nonlinear dynamic optimization and particularly nonlinear MPC is a powerful methodology to address uncertain dynamic systems, however there are several properties that make its application less attractive. Most MPC approaches require the knowledge of a detailed model that describes the system’s dynamics, and stochastic MPC additionally requires an assumption of the uncertainty quantification/propagation. Furthermore, conventional MPC assumes open-loop control actions at future time points in the prediction, which can lead to overly conservative control actions.

In contrast, RL directly accounts for the effect of future uncertainty and its feedback in a proper ‘closed-loop’ manner [2]. In addition, policy gradients can establish a policy in a model-free fashion and excel at on-line computational time. This is because the online computations require only evaluation of a policy, since all the computational cost is shifted off-line.

In this work we propose a two-stage reinforcement learning strategy. We assume that a process model is available, which is exploited to obtain a preliminary optimal control policy. Reinforcement learning is utilized to train the policy off-line for a large number of epochs and episodes, shifting most of the computational effort off-line. This policy has been chosen as a recurrent neural network that receives a window of past states and control actions as well as the current state, providing as an output the stochastic policy from which the control action is drawn.

Subsequently, during the online optimization stage, and by implementing elements from transfer learning [3], the policy network adapts to and optimizes the true system (the plant). The approach is verified in a series of case studies including stochastic differential equation systems with complex dynamics.

[1] R. Sutton, A. Barto, 2018. Reinforcement Learning: An Introduction Second Edition. MIT Press.

[2] J. H. Lee, J. M. Lee, 2006. Approximate dynamic programming based approach to process controland scheduling. Computers & Chemical Engineering 30 (10-12), 1603–1618

[3] A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012. ImageNet Classification with Deep ConvolutionalNeural Networks. In: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advancesin Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1097–1105.