(186i) Predicting Catalytic Activity of Heteropoly Acid Using Machine Learning Model with Small Dataset | AIChE

(186i) Predicting Catalytic Activity of Heteropoly Acid Using Machine Learning Model with Small Dataset


Haider, M. A., Department of Chemical Eng., IIT Delhi
Khan, T. S., Indian Institute of Technology Delhi
Deprotonation Enthalpy (DPE) is a decisive probe to theoretically analyze the ease of proton removal from acid catalysts, facilitating quick catalyst screening. Reducing heavy computational dependence has been popular with Machine Learning (ML) models in material science. The availability of such models can learn the trend in the catalyst properties and further fast-track their selection. The DPE accurately explains the activity of the Keggin-type (HnM12XO40) Heteropoly acids (HPAs) class of acid catalysts. Thus, the dataset for the model is developed on HPAs with the periodic table properties of their central (X) and addenda (M) atoms as feature variables, and their DPE is obtained from Density Functional Theory (DFT) calculations. The model evaluation metrics used are root mean squared (RMS) error and variance score to develop a reliable ML model with the least bias and high accuracy. Accordingly, Gradient Boosting Regression (GBR) Algorithm returned exceptional fitting with a high variance score of 0.88 for the test set and 0.97 for the train set. This supervised learning model showed the least error among the popular linear and ensemble models. Thus, the GBR model predicted DPE with an RMS error of just 5.6 kJ/mol (~0.05 eV) in test data for an average of 100 random train-test splits. It is important to note that the GBR algorithm was trained on small datasets, considering that obtaining massive data on a specific class of catalysts is an arduous task. Herein, the electronic and elemental periodic table properties selected as feature inputs for the model suggest an intrinsic trend with the target descriptor (i.e., DPE)—ensuring good predictions on a small dataset. Further, this algorithm efficiently understands the residual error between true and predicted values while developing the model. This yields an ensemble of these models; therefore, providing an excellent fit to the dataset.