(659e) Thermochemistry of Gas-Phase and Surface Species Via Lasso-Assisted Subgraph Selection

Authors: 
Lym, J., University of Delaware
Gu, G. H., University of Delaware
Plechac, P., Department of Mathematical Sciences
Vlachos, D. G., University of Delaware
First-principles modelling of chemical kinetics provides key insights into reaction mechanisms and better catalysts. However, such modelling becomes prohibitive for large reaction networks due to the computational burden of ab initio methods. This computational burden can be circumvented using graph theory-based semi-empirical methods, such as group additivity. These models assume that the thermodynamic properties of molecules linearly correlate with the frequency of graphs appearing in each molecule. However, the selection of these graphs and correction terms are heuristic, resulting in a suboptimal model.

To improve the accuracy of these methods, we explore the Least Absolute Shrinkage and Selection Operator (LASSO), an automatic descriptor selection algorithm. Graphs in each molecule are exhaustively enumerated, and LASSO regression is performed to find the most optimal set describing the enthalpy of formation. We have applied this framework to a variety of data sets. First, we gather heat of formation data from the NIST and the BURCAT datasets, and found that the LASSO-assisted graph selection shows superior performance over the heuristic-based group additivity (5.7 vs. 9.2 kcal mol-1 mean absolute error). Second, we apply this framework to the QM9 dataset and find our model performs comparable to the state-of-the-art machine learning models with a mean absolute error of 1.39 kcal mol-1. Third, we apply this framework to lignin monomer adsorbates on Pt(111) and obtain a model with a mean absolute error of 2.08 kcal mol-1, showing promising performance for surface species. We discuss how LASSO is especially useful for surface adsorbates to overcome the difficulty in manual identification of strain corrections.