(104h) Families of Data-Driven Surrogates Based on Accuracy and Complexity | AIChE

(104h) Families of Data-Driven Surrogates Based on Accuracy and Complexity


Ahmad, M. - Presenter, National University of Singapore
Karimi, I., National University of Singapore
With rapid advancements in computational technologies and swift progress of Industry 4.0, one can incorporate finer details in digital twins to model intricate processes better. However, this comes at the expense of large computational burden associated with high-fidelity models. Data-driven surrogates or meta-models offer computationally cheaper alternatives to complex digital twins. They build approximate response surface by learning correlations between process inputs and outputs. Any surrogate model comprises of two components, a modeling technique and surrogate form (Garud et al., 2018). While the former indicates an underlying algorithm to build a model, the latter constitutes the analytical form and functional details of a model. Different combinations of modeling techniques and surrogate forms can produce a long list of unique surrogate models. The performance of any surrogate model would depend on system nonlinearities, quality and quantity of sampled data, and flexibility a surrogate model possesses. Previous works have attempted to study effect of certain data-specific features on performance of surrogates. Davis et al., 2017 observed that a surrogate’s predictive performance is influenced by number of input dimensions, sample size, sampling technique, and shape of the data-generating analytical function. They analyzed the impact of these characteristics on training times and predictive performances of eight surrogates. Recently, Williams and Cremaschi, 2021 extended this idea to provide rules-of-thumb to aid surrogate selection for modeling and optimization tasks. They drew conclusions by evaluating the predictive performance of eight surrogate models based on normalized root mean squared error and adjusted R2 over various data sets with different features. Bhosekar and Ierapetritou, 2018 analyzed the effect of sample size and sampling technique on performance of nine different variants of Kriging modeling technique. They also highlighted similar performance of some kriging variants. This observation was also made in our previous work (Garud et al., 2018, Ahmad and Karimi, 2021) on meta-learning-based surrogate selection paradigm, LEAPS2. Certain surrogates showed close performance over most noisy or non-noisy data sets. While such observations are mentioned cryptically in literature, it would be interesting to identify and extensively report similar performing surrogates across various modeling techniques.

Therefore, in this work, we aim to identify sets or families of similar surrogates from a pool of 50 surrogates. The surrogate performances were evaluated over various diverse data sets using two performance metrics. Coefficient of determination (R2) measures the predictive accuracy of a surrogate, while Surrogate Quality Score (SQS) (Ahmad and Karimi, 2021) takes into account model complexity in addition to accuracy. We used correlation coefficient to quantify the extent of agreement or similarity between the performances of any two surrogate models. This enabled us to identify pairs of similar surrogates and hence build families containing mutually similar surrogates. Our results revealed separate and very different families for non-noisy and noisy data sets, based on either performance metric. For non-noisy data sets, we obtained nine families based on both, R2 and SQS metrics. Although the families were almost alike for both performance metrics, they were not identical. Certain complex surrogates especially those belonging to support vector regression technique are penalized heavily by SQS. Hence, they belonged to different families based on R2 and SQS for non-noisy data. While most families comprised of surrogates with the same modeling technique, two families had many surrogates from different modeling techniques, for both performance metrics. For noisy data, surrogates belonging to kriging and radial basis function techniques do not belong to any family since they overfit. Naturally, these techniques are unsuitable for modeling noisy data. Hence, we obtained fewer families than that obtained for non-noisy data. Furthermore, the families based on R2 and SQS were contrasting. Seven families were identified based on R2, while only three were obtained based on SQS metric for noisy data. While some families based on R2 comprised of surrogates using separate techniques, each family based on SQS consisted of surrogates with identical modeling technique. Our families for noisy and non-noisy data sets have been validated by verifying similar surrogates of each family, for several new data sets not used for deriving the original families. Our proposed classification of surrogates into families opens up a computationally efficient way for surrogate selection without the need for exhaustive search across all surrogates.


Bhosekar, A., Ierapetritou, M., 2018. Advances in surrogate based modeling, feasibility analysis, and optimization: A review. Computers & Chemical Engineering 108, 250–267. https://doi.org/10.1016/j.compchemeng.2017.09.017

Davis, S.E., Cremaschi, S., Eden, M.R., 2017. Efficient Surrogate Model Development: Optimum Model Form Based on Input Function Characteristics, in: Computer Aided Chemical Engineering. Elsevier, pp. 457–462. https://doi.org/10.1016/B978-0-444-63965-3.50078-7

Garud, S.S., Karimi, I.A., Kraft, M., 2018. LEAPS2: Learning based Evolutionary Assistive Paradigm for Surrogate Selection. Computers & Chemical Engineering 119, 352–370. https://doi.org/10.1016/j.compchemeng.2018.09.008

Williams, B., Cremaschi, S., 2021. Selection of Surrogate Modeling Techniques for Surface Approximation and Surrogate-Based Optimization. Chemical Engineering Research and Design. https://doi.org/10.1016/j.cherd.2021.03.028