(273b) Estimation of Thermodynamic Properties of Polycyclic Molecules By a Linear Regression Model

Authors: 
LI, Y. P., MIT
Han, K., MIT
Green, W. H., Massachusetts Institute of Technology
Though recent advances in ab initio methods have opened the door toward accurate calculations of molecular thermochemistry, large-scale theoretical studies are still often limited, at least initially, to the use of empirical methods to rapidly screen out unimportant species, so only the important species are the subject of CPU-time intensive quantum chemistry calculations. Among the empirical methods developed in the past decades, the Benson group additivity scheme is one of the quickest and most convenient methods to determine thermodynamic properties of molecules without needing the 3D molecular structures. It has achieved great success in accurate prediction of thermochemistry of small molecules and has been widely adopted in modeling software for on the fly prediction of thermodynamic parameters. However, because this additivity scheme simply uses the properties of individual chemical groups independently to calculate the entire property for a molecule, the contribution of the overall molecular structure to the property is usually not taken into account. This problem manifests itself for strained structures and can cause significant errors for polycyclic molecules with fused rings. Therefore, the application of the additivity method is often restricted to simple chemical systems without the presence of polycyclic species.

To address this issue, we have developed an improved model to predict the thermochemistry of polycyclic species. A regularized linear model was chosen to resemble the simplicity and interpretability of the additivity method. However, instead of following the definitions of the chemical groups of Benson’s scheme, we generated a comprehensive list of identifiers containing local (atoms, bonds, and angles) and/or nonlocal (rings) structural information and used these identifiers to compose feature vectors for molecules. The identifiers carrying nonessential information were eliminated by L1 regularization during model training so that those remained in the final model represent an optimum set of chemical units for the calculation of the thermodynamic property of interest with the contribution of the cyclic structures. These chemical units are human interpretable and conceptually equivalent to the chemical groups defined in Benson’s scheme but are selected by the model objectively without human intervention.

For a training set of 25,716 cyclic and polycyclic organic molecules made up of C, H, and O atoms, 408 identifiers were found to be needed for the calculation of formation enthalpies. The transferability of the trained model was validated on an independent test set of 2,858 molecules with a mean absolute error of 1.88 kcal/mol.