(235f) Data-Efficient Globally Optimal Policy-Based Reinforcement Learning Via Gradient-Accelerated Bayesian Optimization | AIChE

(235f) Data-Efficient Globally Optimal Policy-Based Reinforcement Learning Via Gradient-Accelerated Bayesian Optimization

Authors 

Paulson, J. - Presenter, The Ohio State University
Makrygiorgos, G., UC Berkeley
Mesbah, A., University of California, Berkeley
Reinforcement learning (RL) has received tremendous attention in recent years due to its ability to directly learn solutions to stochastic optimal control problems from data [1]. The main difference between RL problems and traditional supervised or unsupervised machine learning problems is that we must learn how to map a currently measured situation/representation (i.e., system state) to an action so that we can maximize a numerical reward signal [2]. Thus, they are inherently closed-loop problems in which the actions of the learner may affect the actions selected in the future and the learner is not told which actions/inputs to take but must uncover them through repeated interaction with the environment. We can broadly categorize the different RL methods as either model-based or model-free [3]. Since model-free RL methods directly search for a control policy (i.e., function that maps states to actions), they avoid the need to construct a general-purpose model of the system dynamics, which can be difficult to estimate in certain situations (especially when little-to-no prior knowledge on the model structure is available) [4, Chapter 3].

The two main categories of model-free RL are: (i) approximate dynamics programming (ADP) and (ii) policy search methods [3]. The underlying idea in ADP is to use past data to approximate the value function (or action-value function) using Bellman’s principle of optimality. Although ADP has worked well for certain applications, it has some important limitations including failure of convergence of well-known algorithms, such as Q-learning [5] and SARSA [6], on certain types of relatively simple problems. Instead of attempting to approximate a value function and using that to construct a policy, policy search methods look to directly learn a control policy function from closed-loop performance data. Policy gradient methods are particularly popular class of policy search RL algorithms that adjust certain policy parameters θ in the direction of the reward gradient ∇R(θ) that can be estimated from closed-loop simulations (without the need for a dynamic model of the system) using the policy gradient theorem [7]. Although more rigorous convergence guarantees can be obtained with such policy gradient methods, these guarantees only assure convergence to a local optimum since they only look at the gradient of the reward function and not the actual reward value itself. More recent work has suggested to replace the (stochastic) gradient ascent method underpinning policy gradient-based RL methods with global derivative-free optimization (DFO) algorithms. When the parameter space is relatively small, we can replace policy gradient with Bayesian optimization (BO) [8] to sequentially update θ using a probabilistic surrogate model relating the parameters to the closed-loop reward/performance. However, the performance of BO is fundamentally limited by the black-box assumption on the constructed surrogate model (since it does not take advantage of any additional structural information).

In this talk, we present a strategy that combines the advantages of policy gradient and BO search for RL applications so that we can more quickly identify globally optimal control policy parameters, implying we can greatly reduce the amount of closed-loop performance/reward data required by state-of-the-art alternatives. The main idea is that, for any set of tested policy parameter values, we can apply the policy gradient theorem to get a stochastic estimate of both the reward function and its gradient. We can then simultaneously use both pieces of information to build a more accurate surrogate model that can accelerate the design of the next parameter value to try out in the simulator or experimental system. As commonly done in BO, we can construct a Gaussian process (GP) surrogate model that is non-parametric and probabilistic. An important advantage of GPs in our problem setting is that the derivative of a GP remains a GP such that we can very easily incorporate derivative information into the prediction phase as well as the hyperparameter training procedure [9]. Although any expected utility (or acquisition) function can then be utilized, we focus on knowledge gradient (KG) [10], which was specifically developed to handle observations subject to a significant amount of noise. KG searches for the policy parameter value that is most likely to improve the maximum value of the posterior-mean predicted reward function – the quality of the posterior mean can be substantially improved by incorporating derivative information, which directly leads to improved search capabilities. We will demonstrate the improved performance (including direct quantification of the improved speed of convergence) of the proposed gradient-accelerated BO method compared to currently available alternatives, on benchmark problems from the RL literature. To the best of our knowledge, this is one of the first contributions that enables RL to be able to be practically applied to expensive simulations or experiments (can potentially reduce the total required budget to on the order of 10-100 runs).

References

[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

[2] Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.

[3] Recht, B. (2019). A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2, 253-279.

[4] Bertsekas, D. (2012). Dynamic programming and optimal control: Volume I (Vol. 1). Athena scientific.

[5] Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3), 185-202.

[6] Zhao, D., Wang, H., Shao, K., & Zhu, Y. (2016, December). Deep reinforcement learning with experience replay based on SARSA. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (pp. 1-6). IEEE.

[7] Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.

[8] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & De Freitas, N. (2015). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148-175.

[9] Wu, J., Poloczek, M., Wilson, A. G., & Frazier, P. (2017). Bayesian optimization with gradients. Advances in neural information processing systems, 30

[10] Wu, J. (2017). Knowledge Gradient Methods for Bayesian Optimization (Doctoral dissertation, Cornell University).