(152c) Automatic Creation of Molecular Substructure Descriptors for Estimation of Pure Compound Properties | AIChE

(152c) Automatic Creation of Molecular Substructure Descriptors for Estimation of Pure Compound Properties

Authors 

Li, J., The University of Manchester
Fan, X., The University of Manchester
With the rapid development of chemical industries especially advances in chemical synthesis technology, the population of chemical compounds has been substantially increasing.1 Understanding of chemical properties is an important prerequisite for effective utilization of chemical compounds.2 In general compound properties such as normal boiling point and Reid vapour pressure are often obtained through experiments. Although it is the most reliable approach, the experimental measurement of a large number of compound properties is time-consuming and expensive.3,4 In addition, compound properties could change when pressure or temperature changes. As a result, it is imperative to develop suitable mathematical models for prediction of compound properties using experimental data available.

The earliest method for property prediction is the use of simple correlations. For instance, a simple correlation of the property value with the number of carbon atom is developed accordingly.5,6 However, this simple correlation method is only suitable for single family or certain common organic compounds.7 The quantitative structure activity/property relationship (QSAR/QSPR) approach has been widely used for property estimation of chemical compounds.2,8-10 It converts physical and chemical information of a molecule into molecular descriptors and correlates chemical properties with the numerical values of molecular descriptors. However, descriptors are generated according to various distinct rules thus interpretation of some descriptors provisional and weak.11,12 The group contribution (GC) method which fractionates a molecule into groups in common, and then develops correlations for property estimation based on the contribution of groups to the target property. However, risk prediction could happen when the predefined groups cannot represent structure building blocks of the molecule. Then the connectivity index (CI) method13 in which a molecule is represented by the valence connectivity indices14 was introduced to automatically create missing groups. However, the CI method was limited due to the number of atoms available and connectivity indices.

Despite some drawbacks mentioned above, successful application of the GC and CI methods suggests molecular substructure information can be effective for property estimation. Whilst the QSPR method generates descriptors automatically, which means it could be easily adapted to the object to which it applies. Therefore, an approach that may improve the prediction should be able to extract the structural characteristics of molecules efficiently and automatically. In other words, making generation and selection of groups compelled by data analysis and interpretation, rather than by intuition or by personal experience, which meets the recently popular “data-driven” concept.

In this work, we develop a novel framework which can be divided into three parts. It includes matrix representation of molecular connectivity information, submatrix feature generation and regression model development. At first, a molecule is represented by a SMILES code, which can be directly generated using the ChemDraw software15 if the chemical structural is known. Based on the popular cheminformatic tool RDKit, we develop a program to automatically extract atom and bond connection information from the canonical SMILES code of a molecule. With the canonical SMILES code representation of a molecule, each atom in the molecule is identified and labelled using a number sequentially. The number is also called the ID of the atom. A N×N square matrix (denoted as M=(aij)(N×N)) is then constructed to store the connectivity information of a molecule, where N is equal to the total number of atoms in the molecule. A diagonal entry in the matrix is used to store atomic number of an atom in the molecule with the index of the diagonal element consistent with the ID of the atom. A non-diagonal entry (i,j) is used to store bonding information between atom i and atom j. If there is not a bond connecting atom i and atom j, then the entries (i, j) and (j, i) are set to 0. In contrast, 1 for single bond, 2 for double, 1.5 for aromatic bond and 3 for triple bond. By doing this, all structural information of a molecule is stored in this matrix. Therefore, we can extract the environment of each atom in the molecule (i.e. its own information and connectivity information with other atoms). Such environment of an atom in the molecule can also be represented by a matrix, which can be obtained from the connectivity and hence a sub-matrix of M. Each environment of an atom can be considered as a feature of this atom. The environment or feature of an atom in the molecule largely depends on the length of its contiguous atoms. The length can be 1,2,3,4, or more. The corresponding chemical features generated are called the 1st-order feature, 2nd-order feature, 3rd-order feature and 4th-order feature, which are represented by different order submatrices. Lastly, feature engineering methods such as feature selection and dimensional reduction are applied, remained chemical features are used to train and test the artificial neural network (ANN) model. Pearson’s feature selection and PCA reduction are implemented for data cleaning before input to the ANN for regression. ANN hyper-parameters are tuned in order to test the robustness of the proposed method.

In the case study, we adopted the largest normal boiling point database with 5276 molecules.4 All of the molecules were input to generate molecular substructure features. Generated features of three compounds including Ortho-xylene, Meta-xylene and Para-xylene illustrate only using 1st order features, the three isomers can’t be differentiated. While higher order features can be used to differentiate isomers. 15% data points were extracted as test set. Other 85% dataset were used to train the Artificial Neural Network model, K-fold cross-validation (K = 6 in this work) was implemented. The 6 trained prediction model have a close performance, which indicated the proposed method is reliable. The mean absolute percentage (MAPE) error is 4.09% on the test set. Alshehri et al.4 developed a GPR (Gaussian Process Regression) model on GC features. As comparison, we on the one hand input GC features and the proposed submatrix features into the GPR model, the root mean squared error (RMSE) on the test set were 37.64 and 33.33 respectively, while on the training set and test set, the result RMSE were 12.07 and 9.10 respectively. On the other hand, we input both the GC features and the proposed submatrix features of test set into the developed ANN model, the MAPE and RMSE of submatrix features are 34.2177 and 4.774 respectively, which is lower than 40.441 and 5.53 of GC features. The reduced RMSE indicates the proposed submatrix feature method is accurate for property prediction of normal boiling point. The case study used in this text is estimation of normal boiling points, nevertheless, the method could be further applied to production of other chemical properties, since the method can be easily applied by only replacing the input molecules and target property.

References

1. Yoshida, J. i.; Kim, H.; Nagaki, A., Green and sustainable chemical synthesis using flow microreactors. ChemSusChem 2011, 4 (3), 331-340.
2. Katritzky, A. R.; Karelson, M.; Lobanov, V. S., QSPR as a means of predicting and understanding chemical and physical properties in terms of structure. Pure and Applied Chemistry 1997, 69 (2), 245-248.
3. Wen, H.; Su, Y.; Wang, Z.; Jin, S.; Ren, J.; Shen, W.; Eden, M., A systematic modeling methodology of deep neural network‐based structure‐property relationship for rapid and reliable prediction on flashpoints. AlChE J. 2021, e17402.
4. Alshehri, A. S.; Tula, A. K.; You, F.; Gani, R., Next generation pure component property estimation models: With and without machine learning techniques. AlChE J. 2021, e17469.
5. Korsten, H., Characterization of hydrocarbon systems by DBE concept. AlChE J. 1997, 43 (6), 1559-1568.
6. Fisher, C.; CH, F., Equations correlate n-alkane physical properties with chain length. 1982.
7. Van Nes, K.; Van Westen, H. A., Aspects of the constitution of mineral oils. Elsevier Publishing Company: 1951.
8. Vom Lehn, F.; Brosius, B.; Broda, R.; Cai, L.; Pitsch, H., Using machine learning with target-specific feature sets for structure-property relationship modeling of octane numbers and octane sensitivity. Fuel 2020, 281, 118772.
9. Shi, X.; Li, H.; Song, Z.; Zhang, X.; Liu, G., Quantitative composition-property relationship of aviation hydrocarbon fuel based on comprehensive two-dimensional gas chromatography with mass spectrometry and flame ionization detector. Fuel 2017, 200, 395-406.
10. Roubehie Fissa, M.; Lahiouel, Y.; Khaouane, L.; Hanini, S., QSPR estimation models of normal boiling point and relative liquid density of pure hydrocarbons using MLR and MLP-ANN methods. J. Mol. Graphics Modell. 2019, 87, 109-120.
11. Todeschini, R.; Consonni, V., Handbook of molecular descriptors. Wiley-VCH, Weinheim: 2000.
12. Dong, J.; Yao, Z. J.; Zhang, L.; Luo, F.; Lin, Q.; Lu, A. P.; Chen, A. F.; Cao, D. S., PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 2018, 10 (1), 16.
13. Gani, R.; Harper, P. M.; Hostrup, M., Automatic creation of missing groups through connectivity index for pure-component property prediction. Industrial & Engineering Chemistry Research 2005, 44 (18), 7262-7269.
14. Kier, L. B.; Hall, L. H., Molecular connectivity in structure-activity analysis. Research studies: 1986.
15. Cousins, K. R., Computer review of ChemDraw ultra 12.0. ACS Publications: 2011.