(633d) Predictive Life Cycle Assessment with Limited Training Data: Artificial Neural Networks Vs. Gaussian Process Regression

Authors: 
Kleinekorte, J. - Presenter, RWTH Aachen University
Beckert, V., RWTH Aachen University
Fleitmann, L., RWTH Aachen University
Kröger, L., RWTH Aachen University
Leonhard, K., RWTH Aachen University
Bardow, A., RWTH Aachen University
The growing awareness of climate change renders environmentally benign processes a top priority for the chemical industry. The development of environmentally benign processes would benefit from assessments of potential environmental impacts at early stages of development since far-reaching changes could still be implemented with little effort. However, data is very limited in early stages of process development. As a result, classical methods for environmental assessments such as Life Cycle Assessment (LCA) are not applicable. To enable early environmental assessments, predictive LCA models have been developed.

A promising class of predictive LCA models are molecular-structure models. These molecular-structure models employ molecular descriptors as input for machine learning algorithms. Once trained, molecular-structure models circumvent resource-intensive data collection for LCA. The main drawback of these molecular-structure models via machine learning is the dependence on the data basis that is used to train the models. For LCA of chemicals, the data basis is very limited: In recent publications on predictive LCA, artificial neural networks (ANN) are trained with training sets ranging in size from 63 to 392 data points [1]. Usually, reliable prediction by ANN requires large data sets for training [2].

In this work, we apply Gaussian Process Regression (GPR) for predictive LCA modeling based on molecular structure. GPR have been proposed as promising alternative to ANN for small amounts of training data [3]. The prediction performance of the GPR is assessed by prediction accuracy and generalization ability and benchmarked to ANN models that have been proposed for predictive LCA in literature. Both ANN and GPR are applied to predict the Recipe v1.08 (H) Midpoint category Climate Change including biogenic carbon. Both models are automatically designed and trained based on the framework proposed in our previous work [4]. The models are trained on 3 data sets of organic chemicals obtained from 3 commercial databases ranging in size from 62 to 547 data points per data set.

The data sets are split into a training, validation and test set using the Kullback-Leibler divergence [5] to ensure similar distributions in all sets. 80 % of the data sets are used to train the models, 10 % as validation set for hyperparameter optimization, and the remaining 10 % as test set to quantify prediction accuracy. The accuracy is measured by the normalized root mean squared error (RMSE) between predictions and observations. The generalization ability of a model is represented by the gap of the normalized RMSE between training and test set.

The GPR can predict Climate Change impacts from the molecular structure with good prediction accuracy: the normalized RMSE is between 3 % and 7.6 %. In contrast, the ANN has a normalized RMSE about 3 times higher on all data sets. Thus, the GPR outperforms the ANN in terms of prediction accuracy on all data sets.

Additionally, the generalization ability of the GPR improves with increasing set size indicated by gaps decreasing from 3.7 % to 1.1 %. For the ANN models, the performance gap between training and test set also decreases (from 20.1 % to 2.1 %) showing that the generalization ability of the ANN also increases strongly with increasing data set size. Still, the GPR model outperforms the ANN model also in terms of generalization ability: even on the largest data set, the gap of the ANN is still twice as large as the gap of the GPR.

In conclusion, Gaussian Process Regression (GPR) can improve prediction performance and generalization ability of predictive molecular-structure LCA models compared to ANN. Thus, GPR is a promising alternative for predictive LCA where training data is limited today.

Acknowledgement

The authors thank the German Federal Ministry of Education and Research (BMBF) for funding within the project consortium “Carbon2Chem“ under Contract 03EK3042C.