(656e) Developing an ML Algorithm to Predict the Aqueous Solubility of Polymers and Organic Compounds | AIChE

(656e) Developing an ML Algorithm to Predict the Aqueous Solubility of Polymers and Organic Compounds

Authors 

Alshami, A., University of North Dakota
In this work, we develop a machine learning (ML) algorithm for predicting the aqueous solubility of polymers and organic compounds. The solubility of components in various solvents can be nominated as an essential value for understanding the physical properties of various substances. Better and more reliable equilibrium solubility can be determined without the need for time-consuming experimental and pilot studies using an accurate machine learning algorithm by transforming conventional groups into a sequence of bits that computers can recognize. Conventional groups play an important role in calculating the solubility values, and ML can be a game-changer as a contribution method to calculate these measures. To determine conventional groups as the input data in ML method, structure-related information, such as atoms, rings, bonds and functional groups are required. In order to compare molecular structures, an abstractions called molecular fingerprints or molecular descriptors should be used. In this study, we trained an ML algorithm by using chemical descriptors and fingerprints methods. Molecular descriptors are used to illustrate a molecule’s physical, chemical, or topological characteristics. We used seventeen descriptors in this study to predict aqueous solubility and compare their impacts on outputs results.

Molecular fingerprints used in this study is categorized in two shape: path-based and circular fingerprint. Topological or path-based fingerprints include combinations of atom types and paths between various atom types. In this type of fingerprint, fragments of the molecule are generated by following a path up to a certain number of bonds within the molecule. Path-based fingerprints hash all branched and linear molecular subgraphs up to a particular size by combining atom types, the atomic number, and aromaticity state with bond types. The Daylight fingerprint is the most well-known example of path-based fingerprints, and the RDKit fingerprint is a relative of the well-known Daylight fingerprint. In this study, a maximum path length of five (RDK5) was used.

Circular fingerprints are generated by considering the “circular” environment of each atom up to a given “radius” or “diameter” from the central atom. Morgan fingerprint, also known as extended-connectivity fingerprints ECFPs, is the most popular circular fingerprint which perceives the presence of specific circular substructures around each atom in a molecule. ECFPs are a method to identify identical molecules that have different atom numberings by representing the number of heavy-atom neighbors, the number of hydrogens, the isotope, and ring information. ECFPs have different types based on selecting different maximum bond lengths or diameters of the circular atom neighborhood where the digit at the end shows the maximum diameter value employed to generate the fingerprint. In this study, a circular fingerprint with a diameter 4 and 6, i.e., ECFP4 and ECFP6 were used.

We train ~2000 organic and polymeric compounds using the Random Forest (RFs) model as the regressors with the average R2 test values around 0.91 and 0.81 and 0.82, respectively for molecular descriptors, path-based and circular fingerprints. The most important features of each method and their impact on the aqueous solubility were investigated in this study.