(622e) Sigma Profiles in Deep Learning: Towards a Universal Molecular Descriptor | AIChE

(622e) Sigma Profiles in Deep Learning: Towards a Universal Molecular Descriptor

Authors 

Zhang, Y., University of Notre Dame
Maginn, E., University of Notre Dame
The ability of deep neural networks (DNNs) to correlate variables whose relationship is unknown or too complex to be derived is attracting a great deal of interest in chemistry-related fields. However, most DNN-compatible molecular representations commonly used, namely string-based vectors (SMILES or SELFIES), molecular fingerprints, molecular graphs, and Coulomb matrixes, present several shortcomings. In particular, they encapsulate little chemical information beyond atom type and connectivity and their size depends on the size of the molecule: the larger the molecule, the larger the vector or matrix used to represent it. Thus, the input size of a DNN must be made as large as the largest molecule available in the dataset of interest. This leads to the development of complex DNNs that possess many trainable parameters and, thus, need very large datasets to be properly fitted.

By virtue of being unnormalized histograms of screened charges, σ-profiles encode a great deal of chemical information (charge density, polarity, etc.) and their size does not change with the size of the molecule, mitigating the disadvantages of the descriptors mentioned above. As such, this work showcases, for the first time, the ability of σ-profiles to function as universal molecular descriptors in deep learning. To do so, the σ-profiles of 1432 compounds were used to train convolutional neural networks (CNNs) that accurately correlate and predict a wide range of physicochemical properties (molar masses, normal boiling temperatures, vapor pressures, densities, refractive indexes, and aqueous solubilities). To boost their performance, the architecture and hyperparameters of each CNN were optimized using a battery of algorithms, particularly Bayesian Optimization and Local Search. Furthermore, it was shown that thermodynamic conditions, namely temperature, can also be used as additional inputs to broaden the applicability of the models. Among all other advantages mentioned, this work shows that σ-profiles can extend the use of deep learning methodologies to areas where datasets are relatively small and scarce.