(138b) Applying Reinforcement Learning for Batch Trajectory Optimization in an Industrial Chemical Process | AIChE

(138b) Applying Reinforcement Learning for Batch Trajectory Optimization in an Industrial Chemical Process


Rendall, R. - Presenter, University of Coimbra
Ma, Y., Louisiana State University
Castillo, I., Dow Inc.
Wang, Z., Dow Inc.
Chiang, L., Dow Inc.
Bentley, D., The Dow Inc.
Peng, Y., The Dow Chemical Co
Reinforcement Learning (RL) is one of the three basic machine learning paradigms, alongside supervised and unsupervised learning. RL focuses on training an agent to learn an optimal policy, maximizing cumulative rewards from the environment of interest [1]. The recent developments in model-free RL have achieved remarkable success in various process optimization and control tasks, where multiple applications have been reported in the literature, including parameter tuning for existent single PID control loops [2], supply chain management [3] and robotics operations [4].

There are multiple challenges when applying RL in an industrial setting, but the main one concerns the training of the agent. In the learning phase, the agent estimates and improves its policy and value functions through a large number of trial and error iterations. Many input-output experimentations are required, which is obviously not feasible in an industrial chemical plant. As an alternative, a model of the plant can be utilized for training the agent and provide the input-output data. Both first principles and data-driven models are suitable, and both options are explored in this work.

In this work, we test two state-of-the-art RL approaches to optimize an industrial batch case study: Proximal Policy Optimization (PPO), Soft Actor Critic (SAC) and Advantage Actor Critic (A2C). These RL methods optimize the batch process by controlling the reaction conditions and maximizing the total reward (the reward is defined as the profit margin, subject to certain process and safety constraints). The batch optimal trajectories are compared in two scenarios. The first scenario uses, as an environment, a first principles model for training the agent. In the second scenario, a surrogate Long-Short-Term-Memory (LSTM) model is utilized, which combines both historical data from the reactor’s operation and the first principle model estimates. The use of the LSTM is motivated by the fact that it helps mitigate accuracy issues from the first principle model by relying on the relationships found in the plant data. The optimized trajectories were compared to the current trajectories, and the RL optimal batch profiles show a 3% increase in product profit.


  • Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Badgwell, T. A., Liu, K. H., Subrahmanya, N. A., & Kovalski, M. H. (2019). U.S. Patent Application No. 16/218,650.
  • Gokhale, A., Trasikar, C., Shah, A., Hegde, A., & Naik, S. R. (2021). A Reinforcement Learning Approach to Inventory Management. In Advances in Artificial Intelligence and Data Engineering (pp. 281­297). Springer, Singapore.
  • Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.