(659e) Integration of Reinforcement Learning and Model Predictive Control to Optimize Substrate Feeding Strategy of Semi-Batch Bioreactor | AIChE

(659e) Integration of Reinforcement Learning and Model Predictive Control to Optimize Substrate Feeding Strategy of Semi-Batch Bioreactor

Authors 

Oh, T. H. - Presenter, Seoul National University
Lee, J. M., Seoul National University
Research Interests:

My Ph.D. research has focused on developing process modeling, control, and optimization. I have experience in participating the team projects on controlling the simulated moving bed process, optimizing the penicillin product process, and solving the energy distribution in hybrid vehicles. I am interested in developing the algorithms of model predictive control, especially stochastic case, and reinforcement learning.

Abstract:

As the digital transformation of the manufacturing process is progressing, several studies propose to apply model-free reinforcement learning (RL), one of the machine learning methods, to overcome the model-plant mismatch of the process which is difficult to build an accurate model. For examples, the model-free RL algorithms such as double deep Q-network (DDQN), deep deterministic policy gradient (DDPG), actor-critic, and policy gradient algorithms are applied to solve the production scheduling, real-time optimization, and optimal control problem of the process such as simulated moving bed, microfluidic system, textile chemical process, power generation plant, polymerization process, polishing process, and bioprocess.

Obtaining the optimal substrate feeding strategy of the bioprocess associated with its operation cost is one of the major challenges in the chemical engineering field. In general, the model-based optimal control problem is solved in two-stage, that is the open-loop optimization is performed at the upper level to obtain the optimal trajectories of state. Then, the on-line feedback controller is utilized at the lower level to track the given optimal trajectories. However, the model of the bioprocess shows a high level of stochastic and nonlinear behavior as its system dynamics are governed by the complex biochemical reactions that interacted with the various metabolites in a cell. The low accuracy of the model undermines the performance of the model-based optimal control method, and therefore, the model-free RL can be an alternative option to obtain the substrate feeding strategy. However, because the model-free RL algorithms completely exclude the use of the model, there are several expected difficulties that occur when the model-free RL algorithms are directly applied without modifications:

  • The amount of data required for learning may not be feasible.
  • It is not possible to explicitly impose the state constraints.
  • The learning is sensitive to hyperparameter.
  • The learning procedure should be repeated if the cost (reward) is changed.

The lack of available data is the most crucial problem. Obtaining a single batch operation data takes far more time and cost than those of the examples typically tackled in the computer science field. Therefore, it is crucial to use the algorithm which can improve the control policy within a limited amount of data.

In this work, we propose an integrated algorithm of model-free RL and MPC which can improve the control policy within the limited number of data than that of conventionally model-free RL algorithms. Similar to MPC, the proposed algorithm adopts the receding horizon principle that is, it solves the optimal control problem for each time step and implements the control input corresponds to the current time step. However, the action-value function, which learns from the plant data, is assigned as the terminal cost. In this way, the adaptation of system dynamics can be achieved without modifying the model. On the other hand, the learning of the action-value function can be performed with the conventional DDQN algorithm as it is an off-policy algorithm. The specific optimal control problem is presented in figure 1.

The proposed method is the generalization of the DDQN and MPC. The algorithm can be equivalent to DDQN in the continuous action domain by setting the length of the prediction horizon as 0, whereas it can be equivalent to MPC by setting the length of the prediction horizon to the entire horizon length. The length of prediction horizon serves as a tuning parameter in determining the level of how much the model is involved in determining the control input. Because the proposed method is an off-policy algorithm using the model, the data sample efficiency is significantly increased compared to the conventional model-free RL algorithms. In addition, the proposed method is much less sensitive to hyperparameter, and it can explicitly impose state constraints.

For the simulation study, the proposed method is applied to the penicillin product semi-batch bioprocess where the system dynamics are structurally different from the lumped model used in the model-based optimal control method. For the comparison, DDQN, DDPG, and DDP are applied. The simulation results suggest that the proposed method improves the control policy with a limited amount of data than that of DDQN and DDPG. In addition, the proposed method shows better performance than the DDP as it adapts the system dynamics and minimizes the effect of model-plant mismatch.