(95g) A Compressed Sensing Framework for Learning Interpretable Molecular Property Models from Limited Data: Application to Discovery of Sustainable Battery Materials
AIChE Annual Meeting
2023
2023 AIChE Annual Meeting
Computing and Systems Technology Division
Advances in machine learning and intelligent systems I
Sunday, November 5, 2023 - 5:36pm to 5:57pm
We delineate the traditional black-box data-driven modeling paradigm from the methods designed with human interpretability in mind, which we will refer to as symbolic and interpretable models (SIMs). SIMs can define a property in a manner that depicts contributions and competitions across various physically meaningful quantities as an algebraic model; similar to dimensionless numbers. Such methods focus on building low-dimensional terms from a high-dimensional feature space and an operator set, by finding feature and operator combinations that yield the highest correlation to observations. Training such models typically require combinatoric screening to identify terms with the highest correlation, and results in a computationally challenging problem. Although learning the SIMs can be challenging, the resulting models have several properties that justify the approach. In particular, when the feature space consists of physically meaningful descriptors, the resulting model can describe the underlying physics, which yield some of the following benefits: (i) learning a model requires fewer data points, relative to purely data-driven models; (ii) the learned models can provide physical insights into the investigated phenomena; (iii) the learned models have the ability to generalize better than black-box models; and (iv) the learned models provide an efficient "latent representation" for use in efficient non-parametric modeling/optimization frameworks.
This work demonstrates a systematic framework for the data-driven learning of SIMs. Specifically, we use a sure independence screening and sparsifying operator [7] to identify property descriptors from high dimensional Quantitative Structure-Property Relationships (QSPR) feature vectors [8]. Built upon many years of scientific research, QSPR descriptors represent a collection of molecular structural properties that have been deemed pertinent in modeling other chemical properties. Thus, they provide the physically meaningful features of molecules necessary to learn novel molecular property equations. First, we show that with just 115 molecules, we can learn an accurate model of reduction potential, which generalizes beyond the class of molecules used in training to accurately predict the reduction potential of over 100,000 held-out molecules. Second, we build a solubility model and predict both the redox potentials and solubilities of over 600,000 organic molecules. The large set of molecular predictions is used to identify a Pareto front corresponding to the two battery-relevant properties, from which we identify several synthesizable molecules to test as novel organic electrode materials. Initial testing of these novel organic electrode batteries shows energy densities and cycle-based degradation rates comparable to current state-of-the-art organic electrode batteries.
References:
[1] Vermeire, F. H.; Chung, Y.; Green, W. H. Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. J. Am. Chem. Soc 2022, 10785â10797.
[2] Jin, W.; Barzilay, R.; Jaakkola, T. Junction Tree Variational Autoencoder for Molecular Graph Generation. 2019.
[3] Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. Journal of Chemical Information and Computer Sciences 2003, 43, 1947â1958, PMID: 14632445.
[4] Chen, C.-H.; Tanaka, K.; Funatsu, K. Random Forest Model with Combined Features: A Practical Approach to Predict Liquid-crystalline Property. Molecular Informatics 2019, 38, 1800095.
[5] Deringer, V. L.; Bart Ìok, A. P.; Bernstein, N.; Wilkins, D. M.; Ceriotti, M.; Cs Ìanyi, G. Gaussian Process Regression for Materials and Molecules. Chemical Reviews 2021, 121, 10073â10141, PMID: 34398616.
[6] Jorissen, R. N.; Gilson, M. K. Virtual Screening of Molecular Databases Using a Support Vector Machine. Journal of Chemical Information and Modeling 2005, 45, 549â561, PMID: 15921445.
[7] Ouyang, R.; Curtarolo, S.; Ahmetcik, E.; Scheffler, M.; Ghiringhelli, L. M. SISSO: A compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Rev. Mater. 2018, 2, 083802.
[8] Moriwaki, H.; Tian, Y.; Kawashita, N. Mordred: a molecular descriptor calculator. J Cheminform 2018, 10.