(402b) Updated LEAPS2 for Surrogate Recommendation | AIChE

(402b) Updated LEAPS2 for Surrogate Recommendation

Authors 

Ahmad, M. - Presenter, National University of Singapore
Karimi, I. A., National University of Singapore
The growing need for accurately modeling complex physical systems is increasing the complexity of high-fidelity models. Simpler, often analytical, computationally inexpensive surrogates or meta-models offer an attractive alternative. Surrogates are data-driven models that mimic input and output patterns in the data by means of response surfaces. Selecting a surrogate model that approximates a complex system most accurately is critical. A straightforward approach for this selection problem is to “try” several surrogates and select the best. Several other approaches exist in the literature. Genetic Programming (GP) has been used (Koza 1994; Streeter and Becker 2003; Lessmann, Stahlbock, and Crone 2006) to derive an optimum combination of operators and simple basis functions defining a surrogate model. Cozad, Sahinidis, and Miller (2014) developed a low-complexity, accurate model (ALAMO) that uses MILP optimization to identify the best mix of basis functions. Apart from these, MINLP formulations (Cozad and Sahinidis 2018) and extended GP (Kaizen Programming) (Rad, Feng, and Iba 2018) have also been used to good effect. All these works must be exhaustively applied to every new data set to determine the best surrogate. However, a smarter, faster, and learning-based approach is to first unearth patterns that match the meta-features or attributes of a data set with surrogate model performance. Such meta-learning can help select the surrogate for a future data set. Cui et al. 2016 and Garud, Karimi, and Kraft (2018) developed this basic idea into CRS and LEAPS2 frameworks respectively. Later, Davis, Cremaschi, and Eden 2018 studied the performance of surrogates with respect to sample sizes, input dimensions, and shapes of input functions. Although LEAPS2 addressed the limitations of CRS with respect to sample sizes, dimensionality, and surrogates, LEAPS2 has its shortcomings. It was trained only on noise-free synthetic data, so it may recommend an over-fitting model for real-world, noisy data. Moreover, LEAPS2 used an error-based metric that requires splitting the data into train/test sets. Furthermore, LEAPS2 includes both data-distribution based attributes such as local and global fluctuations, dimensionality, etc. along with statistical data-based attributes like mean, standard deviation, gradient, etc. Intuitively, it appears that only distribution-based attributes should determine surrogate performance, as it is the underlying trends or features in data that determine surrogate performance.

In this work, we modified and broadened the scope of LEAPS2 in several significant ways. First, we incorporated noisy and real-world data sets to address a key challenge in surrogate modeling. Second, we added one more metric for surrogate selection, namely a complexity-based metric called AIC weight. This metric provides an alternative for surrogate selection when splitting the dataset into train/test sets is not feasible. Third, we essentially revamped the attribute set of LEAPS2 to use only those attributes that quantify the underlying features of “data-distribution”, rather than the data itself. Some of these new attributes quantify the degree and variations of non-linearity in the data, asymmetry, and flatness of response with respect to standard distribution. Thus, we now have fewer (11 vs 14) but intuitively more appealing attributes in LEAPS2. Fourth, we have improved the surrogate recommendation strategy by developing simple heuristics. Finally, we have updated our surrogate pool by adding 10 new surrogates in LEAPS2. Our improved LEAPS2 framework was evaluated with respect to the two metrics (Garud et al. 2018), namely “Total Degree of Success” (TDoS) that quantifies the success in recommending the best surrogates, and “Total Coefficient of Reward” (TCoR) that combines the success and computational savings in a single score. The new framework gives a TDoS = 91% and a TCoR = 42% for the error-based metric, and TDoS = 83% but a much higher TCoR = 63% for AIC weight on test data. However, they improved during the learning process. We tested the new framework on two case studies with real data, one on a compressor, and the other on COVID-19 data. In both cases, our improved LEAPS2 achieved a TDoS of 100%. This framework acts as a smart tool for surrogate selection to model complex physical systems.

References:

Cozad, Alison, and Nikolaos V. Sahinidis. 2018. “A Global MINLP Approach to Symbolic Regression.” Mathematical Programming 170 (1): 97–119. https://doi.org/10.1007/s10107-018-1289-x.

Cozad, Alison, Nikolaos V. Sahinidis, and David C. Miller. 2014. “Learning Surrogate Models for Simulation-Based Optimization.” AIChE Journal 60 (6): 2211–27. https://doi.org/10.1002/aic.14418.

Cui, Can, Mengqi Hu, Jeffery D. Weir, and Teresa Wu. 2016. “A Recommendation System for Meta-Modeling: A Meta-Learning Based Approach.” Expert Systems with Applications 46 (March): 33–44. https://doi.org/10.1016/j.eswa.2015.10.021.

Davis, Sarah E., Selen Cremaschi, and Mario R. Eden. 2018. “Efficient Surrogate Model Development: Impact of Sample Size and Underlying Model Dimensions.” In Computer Aided Chemical Engineering, 44:979–84. Elsevier. https://doi.org/10.1016/B978-0-444-64241-7.50158-0.

Garud, Sushant S., Iftekhar A. Karimi, and Markus Kraft. 2018. “LEAPS2: Learning Based Evolutionary Assistive Paradigm for Surrogate Selection.” Computers & Chemical Engineering 119 (November): 352–70. https://doi.org/10.1016/j.compchemeng.2018.09.008.

Koza, JohnR. 1994. “Genetic Programming as a Means for Programming Computers by Natural Selection.” Statistics and Computing 4 (2). https://doi.org/10.1007/BF00175355.

Lessmann, Stefan, Robert Stahlbock, and Sven F Crone. 2006. “Genetic Algorithms for Support Vector Machine Model Selection,” 7.

Rad, Hossein Izadi, Ji Feng, and Hitoshi Iba. 2018. “GP-RVM: Genetic Programing-Based Symbolic Regression Using Relevance Vector Machine.” ArXiv:1806.02502 [Cs], August. http://arxiv.org/abs/1806.02502.

Streeter, Matthew, and Lee A Becker. 2003. “Automated Discovery of Numerical Approximation Formulae via Genetic Programming,” 32.