(659e) Thermochemistry of Gas-Phase and Surface Species Via Lasso-Assisted Subgraph Selection
To improve the accuracy of these methods, we explore the Least Absolute Shrinkage and Selection Operator (LASSO), an automatic descriptor selection algorithm. Graphs in each molecule are exhaustively enumerated, and LASSO regression is performed to find the most optimal set describing the enthalpy of formation. We have applied this framework to a variety of data sets. First, we gather heat of formation data from the NIST and the BURCAT datasets, and found that the LASSO-assisted graph selection shows superior performance over the heuristic-based group additivity (5.7 vs. 9.2 kcal mol-1 mean absolute error). Second, we apply this framework to the QM9 dataset and find our model performs comparable to the state-of-the-art machine learning models with a mean absolute error of 1.39 kcal mol-1. Third, we apply this framework to lignin monomer adsorbates on Pt(111) and obtain a model with a mean absolute error of 2.08 kcal mol-1, showing promising performance for surface species. We discuss how LASSO is especially useful for surface adsorbates to overcome the difficulty in manual identification of strain corrections.