(346au) Machine Learning Using the Guest/Host Energy Histogram to Predict the Adsorption of Chain Molecules | AIChE

(346au) Machine Learning Using the Guest/Host Energy Histogram to Predict the Adsorption of Chain Molecules

Authors 

Li, Z. - Presenter, Northwestern University
Bucior, B., Northwestern University
Snurr, R., Northwestern University
Machine learning, as it gains more and more usefulness in computational research, has been successfully used to study nanoporous materials such as zeolites and metal-organic frameworks (MOFs). Recently, Bucior et al. published a study1 using histograms of the guest/host energy as features for LASSO, which is a linear regression machine learning method, to investigate and predict the top candidates for hydrogen and methane storage in MOFs. In this work, we refined the energy histogram methodology and extended it to longer alkanes such as ethane, propane, and n-hexane. Energy histograms are now automatically aggregated without pre-determined bin sizes such that if a range of energy values has more counts, it is represented with more bins in the histogram. To test our new methodology, we performed grand canonical Monte Carlo (GCMC) simulations for materials from the ToBaCCo set of MOF structures2. Both linear and non-linear models such as LASSO,3 Random Forest,4 and Neural Nets (NN)5 were used to develop machine learning models to predict adsorption. We find that the size of the molecule plays a big role in determining which ML methods work best. For ethane, simple linear regression (LASSO) works well, but for propane and hexane we must turn to non-linear methods (RF and NN).

For n-hexane, we found a large number of outliers in the parity plot of ML versus GCMC. We determined that these outliers correspond to points on the adsorption isotherm near the point of condensation in the pores. Through machine learning, we were able to find that the GCMC simulations had not converged. Thus, the error was in the simulation and not the ML model. These un-converged points are nearly impossible to identify by looking at error bars generated from GCMC simulations because the systems remain trapped in one state and appear converged. Thus, we discovered a way to improve the quality of high-throughput screening using molecular simulations at these critical conditions through machine learning.

We also tested additional new features to compensate information loss when constructing one-dimensional histograms from the three-dimensional energy grids. Additionally, the new energy histogram algorithm shows robustness when predicting selectivity for Xe/Kr separation.


References:

(1) Bucior, B. J.; Bobbitt, N. S.; Islamoglu, T.; Goswami, S.; Gopalan, A.; Yildirim, T.; Farha, O. K.; Bagheri, N.; Snurr, R. Q. Energy-Based Descriptors to Rapidly Predict Hydrogen Storage in Metal–Organic Frameworks. Mol. Syst. Des. Eng. 2019, 4 (1), 162–174. https://doi.org/10.1039/C8ME00050F.

(2) Colón, Y. J.; Gómez-Gualdrón, D. A.; Snurr, R. Q. Topologically Guided, Automated Construction of Metal–Organic Frameworks and Their Evaluation for Energy-Related Applications. Cryst. Growth Des. 2017, 17 (11), 5801–5810. https://doi.org/10.1021/acs.cgd.7b00848.

(3) Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58 (1), 267–288.

(4) Liaw, A.; Wiener, M. Classification and Regression by RandomForest. 2002, 2, 5.

(5) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; Kudlur, M.; Levenberg, J.; Monga, R.; Moore, S.; Murray, D. G.; Steiner, B.; Tucker, P.; Vasudevan, V.; Warden, P.; Wicke, M.; Yu, Y.; Zheng, X. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16); 2016; pp 265–283.