(203c) Characterizing Uncertainty and Error in Machine Learning Chemical Property Prediction
AIChE Annual Meeting
Monday, November 8, 2021 - 4:00pm to 4:15pm
We systematically study the influence of model bias, such as errors due to the model architecture and input representation, model variance, as well as target data noise on the performance of graph-convolutional neural networks on chemical prediction tasks. Through a clever design of molecular prediction tasks for which an exact solution is known and achievable for a graph-convolutional neural network, we are able to add errors to the data and models in a controlled manner and study the effects on model performance. We combine the addition of controlled errors with different uncertainty estimation techniques, changes to model architecture, and changes in the size or makeup of the dataset to demonstrate trends important to users of machine learning for property prediction. We show that under the influence of random noise in the training and test set, the true performance of a model can continue to improve with larger datasets but the apparent performance will approach an asymptote and cease to improve. Further, we demonstrate the utility of using heteroscedastic and homoscedastic loss functions to assess the presence of noise errors in the dataset, when those errors are associated with model features and when they are not. We apply measured ensemble variance as a method of assessing epistemic error and use statistical analysis of the results to project how much of the model error is due to variance that can be observed with ensembling and how much is a baseline bias. Using trends over batch size and observed interactions between different uncertainty characterizations, we provide methods for estimating the contribution of different error types to the model performance, the likely effects of adding more data, and the maximum benefit available from ensembling.