(420e) Active Learning for Data-Efficient Training of Machine Learning Models to Predict Adsorption in Metal-Organic Frameworks (MOFs). | AIChE

(420e) Active Learning for Data-Efficient Training of Machine Learning Models to Predict Adsorption in Metal-Organic Frameworks (MOFs).

Authors 

Osaro, E. - Presenter, University of Notre Dame
Fajardo-Rojas, F., Colorado School of Mines
Gomez Gualdron, D., Colorado School of Mines
Metal-organic frameworks (MOFs) are a promising class of porous, crystalline materials for numerous applications. For instance, a MOF with the “right” adsorption properties could enable replacing a given thermal-based, chemical separation process with and adsorption-based one, which could in turn bring up a 10-fold increase in energy efficiency. As chemical separation account for roughly 15% of U.S. energy usage, and about 80% of these separations are done thermally, finding the “right” MOF for each separation could potentially reduce U.S. energy expenditure by around 11%. Given i) the overwhelmingly large MOF “design space” (with trillions of potential designs), and ii) the thousands of chemical separations, each which could be potentially performed at a variety of different operating conditions (OC, e.g., temperature, pressure, relative proportion of components), one can imagine that computation is to play a central role in identifying for each chemical separation the most promising MOF with the corresponding “optimal” operating condition.

The challenge is that classical simulations methods to predict adsorption (e.g., grand canonical Monte Carlo (GCMC)) are just “fast enough” to make thousands to hundreds of thousand adsorption predictions in a reasonable timeframe. However, finding the optimal MOF-OC combination for each chemical separation of interest is a task that would probably entail trillions of adsorption predictions. Thus, faster methods such as machine learning (ML) are better poised to take such task. In earlier work, some of us demonstrated the ability of multilayer perceptron (MLP) models to learn to predict adsorption at multiple conditions, for multiple molecules when provided with GCMC-generated training data for adsorption of different molecules at different pressures in different MOFs. However, this demonstration was limited to near‑spherical, non-polar molecules, and extension to a wider class of molecules requires increasing the diversity and size of the training data. However, due to the computational resources needed to generate training data and training the ML model, there is a critical need to keep training dataset as small as possible.

Active learning (AL) can play a very important role in efficiently and “smartly” navigate the “adsorption space” to limit the burden on data generation while enabling the training of highly predictive ML models. In this work, we first establish the implementation of a Gaussian process regression (GPR) framework to model pure component adsorption of nitrogen at 77K from 10-5 to 1 bar, methane at 298K from 10-5 to 100 bar, carbon dioxide at 298K from 10-5 to 100 bar, and hydrogen at 77K from 10-5 to 100 bar on eleven diverse sets of MOFs. In this GPR framework, a first model is trained with an initial data set known as the “prior.” Then subsequent models are retrained upon subsequent addition of adsorption data to the dataset, which is decided by the uncertainty of the GP model evaluated on a new data set. Here, we tested three different “prior” selection schemes and make a recommendation on the best prior selection scheme for 44 adsorbate-adsorbent pairs. Recommendation is primarily based on the mean absolute error and the total amount of data points required for convergence of the prediction of the ML model.

Upon establishing the GPR framework, we demonstrated the application of the methodology to include alchemical molecules. These hypothetical species can be characterized by two main features: intermolecular potential parameters (e.g., well-depth and the distance at which the intermolecular potential between two particles is zero); and intra-molecular properties, such as bond length and charges.

A previously developed MLP model trained on GCMC data points (approximately 5 million) obtained from 1800 topologically and chemically diverse ToBaCCo generated MOFs using several single- and multiple-site alchemical species at different fugacities has led to the progress in adsorption studies providing accurate results for a diverse set of real molecules. Using the established AL framework and the developed MLP model as a substitute for GCMC, we show we can make accurate GPR models that predict the isotherm of all alchemical species across these 1800 diverse MOFs using a different set of test-data set including the fugacity and alchemical parameters. Our results show we saved 57.5% of the data, indicating that only around 2.2 million simulations are needed to train a new MLP model for adsorption.

Topics