(371aa) Disruptive Arti?cial Intelligence (Reinforcement Learning) Based Predictive Control

Srinivas, S., TCS Research
Masampally, V., TCS Research
Runkana, V., TCS Research
Examining the consequences of a given designated process operating conditions over an assertive neighborhood towards the production objective and to yield the optimized value for the aforementioned process operating conditions in order to achieve the desired production target. Cognitive Artificial Intelligence optimal control strategies namely, Markov decision process (MDP, a discrete time stochastic control) is utilized for solving the complex sequential decision-making. This in turn, is posed as an optimization problem and solved through Reinforcement Learning (RL), a perspective to automating goal-directed learning and decision-making. RL is a subset of machine learning algorithms which learns to accomplish a complex objective (goal). This is achieved by self-autonomous agent of RL that interacts with their respective environment (posed as MDP) by collecting rewards to learn optimal behaviors. The algorithms leverages emerging state of the art, supervised deep learning neural network architectures for function approximation. RL algorithms approaches are sub-divided as policy optimization and dynamic programming. Dynamic programming RL algorithms are further sub-classified as policy iteration, value iteration and Q-learning. Policy iteration algorithm includes policy evaluation (evaluate value function of a random policy) and then policy improvement (based on previous value function using the Bellman operator). This is repeated iteratively until policy converges. Value iteration algorithms includes finding optimal value function and then derive optimal policy from the optimal value function. Q-Learning maps each exclusive state-action pair to a value by estimating a value function. Deep Q-Learning (published by David Silver, Google DeepMind, 2015), Double Q-Network (published by Hado van Hasselt, Google DeepMind, 2016), Duelling Q-Network (published by Ziyu Wang, Google, DeepMind, 2015), Prioritized Experience Replay(PER-Double Q-Network, published by Tom Schaul, Google DeepMind, 2015) are extensions of vanilla Q-Learning for handling large discrete state-action space. Rainbow algorithm published by Matteo Hessel, Google DeepMind, 2017 combines improvements in the variant methods of the vanilla Q-Learning algorithms including Multi-step Returns, Distributional RL & Noisy Nets lead to the state of the art results when compared with a baseline of the individuals alone.

To handle continuous or stochastic action space, policy-based algorithms (Reinforce with policy gradients) are proposed, optimizes the policy without using a value function. A hybrid method namely, Advantage Actor-Critic (A2C) which consists of two distinct deep neural networks, a critic that measures quality of the action taken (value-based) & an actor that controls how our agent behaves (policy-based) stabilizes learning in comparison with the former. An extension to A2C namely, Asynchronous Advantage Actor-Critic (A3C) algorithm involves executing a set of environments in parallel and the policy gradient updates are done using the advantage function published by Volodymyr Mnih, Google DeepMind, 2016. For improving the stability, convergence and sample efficiency of the stochastic policy gradient method. Proximal Policy Optimization (PPO), implements clipped surrogate objective on the policy update, published by John Schulman, Open AI, 2017. Trust Region Policy Optimization (TRPO), enforces Kullback–Leibler divergence constraint on the size of policy update at each iteration, published by John Schulman, UC Berkley, 2017. Kronecker-Factored Trust Region Actor-Critic(A2C) Policy Optimization(ACKTR), Kronecker-Factored Approximation Curvature (K-FAC) is utilized for the gradient update for both the critic and actor published by Yuhuai Wu, University of Toronto. Soft Actor-Critic(SAC), integrates the entropy computation of the policy into the reward to steer exploration. It is an off-policy actor-critic model published by Ziyu Wang, Google DeepMind, 2017.

The algorithms described above model the policy function as a probability distribution over actions for a know current state(stochastic). Deterministic Policy Gradients (DPG), published by David Silver, Google DeepMind, 2014 instead models the policy as a deterministic rather than stochastic. Deep Deterministic Policy Gradients(DDPG), incorporates DPG with DQN & learns a stable Q-function by experience replay and the fixed target network. DDPG learns a deterministic policy & extends it to the continuous space with the actorcritic framework published by Lillicrap, Google DeepMind, 2015. Distributed Distributional Deep Deterministic Policy Gradients (D4PG), the distributional critic estimates the expected Q value as a random variable, multiple distributed parallel actors gather experience in parallel & implements Prioritized Experience Replay (PER). D4PG are model-free variants, off-policy, actor-critic algorithm which learns policies in high dimensional, continuous action spaces published by Gabriel Barth-Maron, Google DeepMind, 2018.

Augmented Random Search (ARS) algorithm published by Horia Mania, UC Berkely, 2018 is a random search method for training linear policies, utilized for continuous control problems ( augments the basic random search method) & achieves faster computations when compared to any other baseline RL algorithm. The subset of black-box optimization methods namely Evolution strategies(ES) are applied for a competitive alternative for training function approximators namely, deep neural networks for Reinforcement Learning. Evolution Strategies(ES), a kind of model-agnostic optimization approach by imitating Darwin’s theory of the evolution of species by natural selection it learns the optimal solution. Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Genetic Algorithms are utilized for function approximators. Deep Recurrent Q-Learning for Partially Observable MDPs published by Matthew Hausknecht, Microsoft Research, 2015 overcomes the limitation of the memory of RL agents. Distributional Reinforcement Learning with Quantile Regression published by Will Dabney, Google DeepMind, 2017 examines distinct ways of learning the value distribution rather than that of the traditional value function. GAN Qlearning, published by Thang Doan, McGill University, 2017 utilizes generative adversarial networks (GANs) for an alternative way of leveraging the distributional methodology to reinforcement learning for better learning the function approximator. Artificial Intelligence based Cognitive autonomous agents are all set for real time monitoring and predictive control. State of the art results are obtained for a Multi-Input Multi-Output(MIMO) real-time industrial scale problem. The above implemented algorithms, their architectures & the results obtained will be discussed in comparison to the baseline of traditional Model based Optimal Control. Thanks largely to GPU-backed machines for the extensive computations.